1
|
Cullen JN, Friedenberg SG. Whole Animal Genome Sequencing: user-friendly, rapid, containerized pipelines for processing, variant discovery, and annotation of short-read whole genome sequencing data. G3 (BETHESDA, MD.) 2023; 13:jkad117. [PMID: 37243692 PMCID: PMC10411559 DOI: 10.1093/g3journal/jkad117] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 02/24/2023] [Accepted: 05/20/2023] [Indexed: 05/29/2023]
Abstract
Advancements in massively parallel short-read sequencing technologies and the associated decreasing costs have led to large and diverse variant discovery efforts across species. However, processing high-throughput short-read sequencing data can be challenging with potential pitfalls and bioinformatics bottlenecks in generating reproducible results. Although a number of pipelines exist that address these challenges, these are often geared toward human or traditional model organism species and can be difficult to configure across institutions. Whole Animal Genome Sequencing (WAGS) is an open-source set of user-friendly, containerized pipelines designed to simplify the process of identifying germline short (SNP and indel) and structural variants (SVs) geared toward the veterinary community but adaptable to any species with a suitable reference genome. We present a description of the pipelines [adapted from the best practices of the Genome Analysis Toolkit (GATK)], along with benchmarking data from both the preprocessing and joint genotyping steps, consistent with a typical user workflow.
Collapse
Affiliation(s)
- Jonah N Cullen
- Department of Veterinary Clinical Sciences, College of Veterinary Medicine, University of Minnesota, 1352 Boyd Ave, Saint Paul, MN 55108, USA
| | - Steven G Friedenberg
- Department of Veterinary Clinical Sciences, College of Veterinary Medicine, University of Minnesota, 1352 Boyd Ave, Saint Paul, MN 55108, USA
| |
Collapse
|
2
|
Ahmed Z, Renart EG, Mishra D, Zeeshan S. JWES: a new pipeline for whole genome/exome sequence data processing, management, and gene-variant discovery, annotation, prediction, and genotyping. FEBS Open Bio 2021; 11:2441-2452. [PMID: 34370400 PMCID: PMC8409305 DOI: 10.1002/2211-5463.13261] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 07/18/2021] [Accepted: 08/02/2021] [Indexed: 01/07/2023] Open
Abstract
Whole genome and exome sequencing (WGS/WES) are the most popular next‐generation sequencing (NGS) methodologies and are at present often used to detect rare and common genetic variants of clinical significance. We emphasize that automated sequence data processing, management, and visualization should be an indispensable component of modern WGS and WES data analysis for sequence assembly, variant detection (SNPs, SVs), imputation, and resolution of haplotypes. In this manuscript, we present a newly developed findable, accessible, interoperable, and reusable (FAIR) bioinformatics‐genomics pipeline Java based Whole Genome/Exome Sequence Data Processing Pipeline (JWES) for efficient variant discovery and interpretation, and big data modeling and visualization. JWES is a cross‐platform, user‐friendly, product line application, that entails three modules: (a) data processing, (b) storage, and (c) visualization. The data processing module performs a series of different tasks for variant calling, the data storage module efficiently manages high‐volume gene‐variant data, and the data visualization module supports variant data interpretation with Circos graphs. The performance of JWES was tested and validated in‐house with different experiments, using Microsoft Windows, macOS Big Sur, and UNIX operating systems. JWES is an open‐source and freely available pipeline, allowing scientists to take full advantage of all the computing resources available, without requiring much computer science knowledge. We have successfully applied JWES for processing, management, and gene‐variant discovery, annotation, prediction, and genotyping of WGS and WES data to analyze variable complex disorders. In summary, we report the performance of JWES with some reproducible case studies, using open access and in‐house generated, high‐quality datasets.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Deepshikha Mishra
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
3
|
Ahmed Z, Renart EG, Zeeshan S. Genomics pipelines to investigate susceptibility in whole genome and exome sequenced data for variant discovery, annotation, prediction and genotyping. PeerJ 2021; 9:e11724. [PMID: 34395068 PMCID: PMC8320519 DOI: 10.7717/peerj.11724] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 06/14/2021] [Indexed: 12/12/2022] Open
Abstract
Over the last few decades, genomics is leading toward audacious future, and has been changing our views about conducting biomedical research, studying diseases, and understanding diversity in our society across the human species. The whole genome and exome sequencing (WGS/WES) are two of the most popular next-generation sequencing (NGS) methodologies that are currently being used to detect genetic variations of clinical significance. Investigating WGS/WES data for the variant discovery and genotyping is based on the nexus of different data analytic applications. Although several bioinformatics applications have been developed, and many of those are freely available and published. Timely finding and interpreting genetic variants are still challenging tasks among diagnostic laboratories and clinicians. In this study, we are interested in understanding, evaluating, and reporting the current state of solutions available to process the NGS data of variable lengths and types for the identification of variants, alleles, and haplotypes. Residing within the scope, we consulted high quality peer reviewed literature published in last 10 years. We were focused on the standalone and networked bioinformatics applications proposed to efficiently process WGS and WES data, and support downstream analysis for gene-variant discovery, annotation, prediction, and interpretation. We have discussed our findings in this manuscript, which include but not are limited to the set of operations, workflow, data handling, involved tools, technologies and algorithms and limitations of the assessed applications.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
4
|
Musacchia F, Ciolfi A, Mutarelli M, Bruselles A, Castello R, Pinelli M, Basu S, Banfi S, Casari G, Tartaglia M, Nigro V. VarGenius executes cohort-level DNA-seq variant calling and annotation and allows to manage the resulting data through a PostgreSQL database. BMC Bioinformatics 2018; 19:477. [PMID: 30541431 PMCID: PMC6291943 DOI: 10.1186/s12859-018-2532-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Accepted: 11/21/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Targeted resequencing has become the most used and cost-effective approach for identifying causative mutations of Mendelian diseases both for diagnostics and research purposes. Due to very rapid technological progress, NGS laboratories are expanding their capabilities to address the increasing number of analyses. Several open source tools are available to build a generic variant calling pipeline, but a tool able to simultaneously execute multiple analyses, organize, and categorize the samples is still missing. RESULTS Here we describe VarGenius, a Linux based command line software able to execute customizable pipelines for the analysis of multiple targeted resequencing data using parallel computing. VarGenius provides a database to store the output of the analysis (calling quality statistics, variant annotations, internal allelic variant frequencies) and sample information (personal data, genotypes, phenotypes). VarGenius can also perform the "joint analysis" of hundreds of samples with a single command, drastically reducing the time for the configuration and execution of the analysis. VarGenius executes the standard pipeline of the Genome Analysis Tool-Kit (GATK) best practices (GBP) for germinal variant calling, annotates the variants using Annovar, and generates a user-friendly output displaying the results through a web page. VarGenius has been tested on a parallel computing cluster with 52 machines with 120GB of RAM each. Under this configuration, a 50 M whole exome sequencing (WES) analysis for a family was executed in about 7 h (trio or quartet); a joint analysis of 30 WES in about 24 h and the parallel analysis of 34 single samples from a 1 M panel in about 2 h. CONCLUSIONS We developed VarGenius, a "master" tool that faces the increasing demand of heterogeneous NGS analyses and allows maximum flexibility for downstream analyses. It paves the way to a different kind of analysis, centered on cohorts rather than on singleton. Patient and variant information are stored into the database and any output file can be accessed programmatically. VarGenius can be used for routine analyses by biomedical researchers with basic Linux skills providing additional flexibility for computational biologists to develop their own algorithms for the comparison and analysis of data. The software is freely available at: https://github.com/frankMusacchia/VarGenius.
Collapse
Affiliation(s)
- F. Musacchia
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - A. Ciolfi
- Genetics and Rare Diseases Research Division, Bambino Gesù Children’s Hospital, Istituto di Ricovero e Cura a Carattere Scientifico, Rome, Italy
| | - M. Mutarelli
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - A. Bruselles
- Department of Oncology and Molecular Medicine, Istituto Superiore di Sanità, Rome, Italy
| | - R. Castello
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - M. Pinelli
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - S. Basu
- Department of Medical Biochemistry and Cell Biology Institue of Biomedicine, The Sahlgrenska Academy University of Gothenburg, Gothenburg, Sweden
| | - S. Banfi
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
- Università degli studi della Campania “Luigi Vanvitelli”, Caserta, Italy
| | - G. Casari
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
| | - M. Tartaglia
- Genetics and Rare Diseases Research Division, Bambino Gesù Children’s Hospital, Istituto di Ricovero e Cura a Carattere Scientifico, Rome, Italy
| | - V. Nigro
- Telethon Institute for Genetics and Medicine, Viale Campi Flegrei, 34, 80078 Pozzuoli (Naples), Italy
- Università degli studi della Campania “Luigi Vanvitelli”, Caserta, Italy
| |
Collapse
|
5
|
Pan C, McInnes G, Deflaux N, Snyder M, Bingham J, Datta S, Tsao PS. Cloud-based interactive analytics for terabytes of genomic variants data. Bioinformatics 2017; 33:3709-3715. [PMID: 28961771 PMCID: PMC5860318 DOI: 10.1093/bioinformatics/btx468] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2017] [Revised: 06/30/2017] [Accepted: 07/25/2017] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired. RESULTS We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information. AVAILABILITY AND IMPLEMENTATION Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs. CONTACT cuiping@stanford.edu or ptsao@stanford.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cuiping Pan
- VA Palo Alto Health Care System, Palo Alto Epidemiology Research and Information Center for Genomics, CA, USA
- Department of Genetics, Stanford University, CA, USA
| | - Gregory McInnes
- VA Palo Alto Health Care System, Palo Alto Epidemiology Research and Information Center for Genomics, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, CA, USA
| | - Nicole Deflaux
- Google, Mountain View, CA, USA
- Verily Life Sciences, South San Francisco, CA, USA
| | - Michael Snyder
- Department of Genetics, Stanford University, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, CA, USA
| | - Jonathan Bingham
- Google, Mountain View, CA, USA
- Verily Life Sciences, South San Francisco, CA, USA
| | - Somalee Datta
- VA Palo Alto Health Care System, Palo Alto Epidemiology Research and Information Center for Genomics, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, CA, USA
| | - Philip S Tsao
- VA Palo Alto Health Care System, Palo Alto Epidemiology Research and Information Center for Genomics, CA, USA
- Division of Cardiovascular Medicine, Stanford University, Stanford, CA, USA
| |
Collapse
|
6
|
MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants. BMC Bioinformatics 2017; 18:49. [PMID: 28107819 PMCID: PMC5248509 DOI: 10.1186/s12859-016-1454-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Accepted: 12/24/2016] [Indexed: 12/28/2022] Open
Abstract
Background Next Generation Genome sequencing techniques became affordable for massive sequencing efforts devoted to clinical characterization of human diseases. However, the cost of providing cloud-based data analysis of the mounting datasets remains a concerning bottleneck for providing cost-effective clinical services. To address this computational problem, it is important to optimize the variant analysis workflow and the used analysis tools to reduce the overall computational processing time, and concomitantly reduce the processing cost. Furthermore, it is important to capitalize on the use of the recent development in the cloud computing market, which have witnessed more providers competing in terms of products and prices. Results In this paper, we present a new package called MC-GenomeKey (Multi-Cloud GenomeKey) that efficiently executes the variant analysis workflow for detecting and annotating mutations using cloud resources from different commercial cloud providers. Our package supports Amazon, Google, and Azure clouds, as well as, any other cloud platform based on OpenStack. Our package allows different scenarios of execution with different levels of sophistication, up to the one where a workflow can be executed using a cluster whose nodes come from different clouds. MC-GenomeKey also supports scenarios to exploit the spot instance model of Amazon in combination with the use of other cloud platforms to provide significant cost reduction. To the best of our knowledge, this is the first solution that optimizes the execution of the workflow using computational resources from different cloud providers. Conclusions MC-GenomeKey provides an efficient multicloud based solution to detect and annotate mutations. The package can run in different commercial cloud platforms, which enables the user to seize the best offers. The package also provides a reliable means to make use of the low-cost spot instance model of Amazon, as it provides an efficient solution to the sudden termination of spot machines as a result of a sudden price increase. The package has a web-interface and it is available for free for academic use.
Collapse
|
7
|
Hintzsche J, Kim J, Yadav V, Amato C, Robinson SE, Seelenfreund E, Shellman Y, Wisell J, Applegate A, McCarter M, Box N, Tentler J, De S, Robinson WA, Tan AC. IMPACT: a whole-exome sequencing analysis pipeline for integrating molecular profiles with actionable therapeutics in clinical samples. J Am Med Inform Assoc 2016; 23:721-30. [PMID: 27026619 DOI: 10.1093/jamia/ocw022] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Accepted: 02/01/2016] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Currently, there is a disconnect between finding a patient's relevant molecular profile and predicting actionable therapeutics. Here we develop and implement the Integrating Molecular Profiles with Actionable Therapeutics (IMPACT) analysis pipeline, linking variants detected from whole-exome sequencing (WES) to actionable therapeutics. METHODS AND MATERIALS The IMPACT pipeline contains 4 analytical modules: detecting somatic variants, calling copy number alterations, predicting drugs against deleterious variants, and analyzing tumor heterogeneity. We tested the IMPACT pipeline on whole-exome sequencing data in The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples with known EGFR mutations. We also used IMPACT to analyze melanoma patient tumor samples before treatment, after BRAF-inhibitor treatment, and after BRAF- and MEK-inhibitor treatment. RESULTS IMPACT Food and Drug Administration (FDA) correctly identified known EGFR mutations in the TCGA lung adenocarcinoma samples. IMPACT linked these EGFR mutations to the appropriate FDA-approved EGFR inhibitors. For the melanoma patient samples, we identified NRAS p.Q61K as an acquired resistance mutation to BRAF-inhibitor treatment. We also identified CDKN2A deletion as a novel acquired resistance mutation to BRAFi/MEKi inhibition. The IMPACT analysis pipeline predicts these somatic variants to actionable therapeutics. We observed the clonal dynamic in the tumor samples after various treatments. We showed that IMPACT not only helped in successful prioritization of clinically relevant variants but also linked these variations to possible targeted therapies. CONCLUSION IMPACT provides a new bioinformatics strategy to delineate candidate somatic variants and actionable therapies. This approach can be applied to other patient tumor samples to discover effective drug targets for personalized medicine.IMPACT is publicly available at http://tanlab.ucdenver.edu/IMPACT.
Collapse
Affiliation(s)
- Jennifer Hintzsche
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Jihye Kim
- Division of Medical Oncology, Department of Medicine, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Vinod Yadav
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, School of Medicine
| | - Carol Amato
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Steven E Robinson
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Eric Seelenfreund
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Yiqun Shellman
- Department of Dermatology, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Joshua Wisell
- Department of Pathology, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Allison Applegate
- Division of Medical Oncology, Department of Medicine, School of Medicine
| | - Martin McCarter
- Department of Surgery, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Neil Box
- Department of Dermatology, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - John Tentler
- Division of Medical Oncology, Department of Medicine, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Subhajyoti De
- Division of Biomedical Informatics and Personalized Medicine, Department of Medicine, School of Medicine Department of Biostatistics and Informatics, Colorado School of Public Health University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - William A Robinson
- Division of Medical Oncology, Department of Medicine, School of Medicine University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Aik Choon Tan
- Division of Medical Oncology, Department of Medicine, School of Medicine Department of Biostatistics and Informatics, Colorado School of Public Health University of Colorado Cancer Center All: University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| |
Collapse
|
8
|
Roy S, LaFramboise WA, Nikiforov YE, Nikiforova MN, Routbort MJ, Pfeifer J, Nagarajan R, Carter AB, Pantanowitz L. Next-Generation Sequencing Informatics: Challenges and Strategies for Implementation in a Clinical Environment. Arch Pathol Lab Med 2016; 140:958-75. [PMID: 26901284 DOI: 10.5858/arpa.2015-0507-ra] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
CONTEXT -Next-generation sequencing (NGS) is revolutionizing the discipline of laboratory medicine, with a deep and direct impact on patient care. Although it empowers clinical laboratories with unprecedented genomic sequencing capability, NGS has brought along obvious and obtrusive informatics challenges. Bioinformatics and clinical informatics are separate disciplines with typically a small degree of overlap, but they have been brought together by the enthusiastic adoption of NGS in clinical laboratories. The result has been a collaborative environment for the development of novel informatics solutions. Sustaining NGS-based testing in a regulated clinical environment requires institutional support to build and maintain a practical, robust, scalable, secure, and cost-effective informatics infrastructure. OBJECTIVE -To discuss the novel NGS informatics challenges facing pathology laboratories today and offer solutions and future developments to address these obstacles. DATA SOURCES -The published literature pertaining to NGS informatics was reviewed. The coauthors, experts in the fields of molecular pathology, precision medicine, and pathology informatics, also contributed their experiences. CONCLUSIONS -The boundary between bioinformatics and clinical informatics has significantly blurred with the introduction of NGS into clinical molecular laboratories. Next-generation sequencing technology and the data derived from these tests, if managed well in the clinical laboratory, will redefine the practice of medicine. In order to sustain this progress, adoption of smart computing technology will be essential. Computational pathologists will be expected to play a major role in rendering diagnostic and theranostic services by leveraging "Big Data" and modern computing tools.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Liron Pantanowitz
- From the Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania (Drs Roy, LaFramboise, Nikiforov, Nikiforova, and Pantanowitz); the Department of Pathology, MD Anderson Cancer Center, Houston, Texas (Dr Routbort); the Department of Pathology and Immunology, Washington University School of Medicine, St Louis, Missouri (Drs Pfeifer and Nagarajan); PierianDx, St Louis, Missouri (Dr Nagarajan); and the Department of Pathology and Laboratory Medicine, Children's Healthcare of Atlanta, Atlanta, Georgia (Dr Carter)
| |
Collapse
|
9
|
Abstract
Next-generation sequencing (NGS) approaches are highly applicable to clinical studies. We review recent advances in sequencing technologies, as well as their benefits and tradeoffs, to provide an overview of clinical genomics from study design to computational analysis. Sequencing technologies enable genomic, transcriptomic, and epigenomic evaluations. Studies that use a combination of whole genome, exome, mRNA, and bisulfite sequencing are now feasible due to decreasing sequencing costs. Single-molecule sequencing increases read length, with the MinIONTM nanopore sequencer, which offers a uniquely portable option at a lower cost. Many of the published comparisons we review here address the challenges associated with different sequencing methods. Overall, NGS techniques, coupled with continually improving analysis algorithms, are useful for clinical studies in many realms, including cancer, chronic illness, and neurobiology. We, and others in the field, anticipate the clinical use of NGS approaches will continue to grow, especially as we shift into an era of precision medicine.
Collapse
Affiliation(s)
- Priyanka Vijay
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York. Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York
| | - Alexa B.R. McIntyre
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York. Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York
| | - Christopher E. Mason
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York. Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York. Feil Family Brain and Mind Research Institute, New York, New York
| | - Jeffrey P. Greenfield
- Department of Neurological Surgery, New York-Presbyterian Hospital, Weill Cornell Medical College, New York, New York
| | - Sheng Li
- Department of Neurological Surgery, New York-Presbyterian Hospital, Weill Cornell Medical College, New York, New York
| |
Collapse
|
10
|
Abstract
High-throughput platforms such as microarray, mass spectrometry, and next-generation sequencing are producing an increasing volume of omics data that needs large data storage and computing power. Cloud computing offers massive scalable computing and storage, data sharing, on-demand anytime and anywhere access to resources and applications, and thus, it may represent the key technology for facing those issues. In fact, in the recent years it has been adopted for the deployment of different bioinformatics solutions and services both in academia and in the industry. Although this, cloud computing presents several issues regarding the security and privacy of data, that are particularly important when analyzing patients data, such as in personalized medicine. This chapter reviews main academic and industrial cloud-based bioinformatics solutions; with a special focus on microarray data analysis solutions and underlines main issues and problems related to the use of such platforms for the storage and analysis of patients data.
Collapse
Affiliation(s)
- Barbara Calabrese
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
| | - Mario Cannataro
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy.
| |
Collapse
|
11
|
Shameer K, Tripathi LP, Kalari KR, Dudley JT, Sowdhamini R. Interpreting functional effects of coding variants: challenges in proteome-scale prediction, annotation and assessment. Brief Bioinform 2015; 17:841-62. [PMID: 26494363 DOI: 10.1093/bib/bbv084] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Indexed: 12/20/2022] Open
Abstract
Accurate assessment of genetic variation in human DNA sequencing studies remains a nontrivial challenge in clinical genomics and genome informatics. Ascribing functional roles and/or clinical significances to single nucleotide variants identified from a next-generation sequencing study is an important step in genome interpretation. Experimental characterization of all the observed functional variants is yet impractical; thus, the prediction of functional and/or regulatory impacts of the various mutations using in silico approaches is an important step toward the identification of functionally significant or clinically actionable variants. The relationships between genotypes and the expressed phenotypes are multilayered and biologically complex; such relationships present numerous challenges and at the same time offer various opportunities for the design of in silico variant assessment strategies. Over the past decade, many bioinformatics algorithms have been developed to predict functional consequences of single nucleotide variants in the protein coding regions. In this review, we provide an overview of the bioinformatics resources for the prediction, annotation and visualization of coding single nucleotide variants. We discuss the currently available approaches and major challenges from the perspective of protein sequence, structure, function and interactions that require consideration when interpreting the impact of putatively functional variants. We also discuss the relevance of incorporating integrated workflows for predicting the biomedical impact of the functionally important variations encoded in a genome, exome or transcriptome. Finally, we propose a framework to classify variant assessment approaches and strategies for incorporation of variant assessment within electronic health records.
Collapse
|
12
|
Souilmi Y, Lancaster AK, Jung JY, Rizzo E, Hawkins JB, Powles R, Amzazi S, Ghazal H, Tonellato PJ, Wall DP. Scalable and cost-effective NGS genotyping in the cloud. BMC Med Genomics 2015; 8:64. [PMID: 26470712 PMCID: PMC4608296 DOI: 10.1186/s12920-015-0134-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Accepted: 09/11/2015] [Indexed: 12/20/2022] Open
Abstract
Background While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10’s of dollars. Results We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Conclusions Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration. Electronic supplementary material The online version of this article (doi:10.1186/s12920-015-0134-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yassine Souilmi
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA. .,Department of Biology, Mohamed Vth University, 4 Ibn Battouta Avenue, B.P: 1014RP, Rabat, Morocco.
| | - Alex K Lancaster
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA. .,Department of Pathology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, 02215, USA.
| | - Jae-Yoon Jung
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA.
| | - Ettore Rizzo
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, via Ferrata 1, Pavia, 27100, Italy.
| | - Jared B Hawkins
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA.
| | - Ryan Powles
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA.
| | - Saaïd Amzazi
- Department of Biology, Mohamed Vth University, 4 Ibn Battouta Avenue, B.P: 1014RP, Rabat, Morocco.
| | - Hassan Ghazal
- Department of Biology, Mohamed First University, Oujda, Nador, Morocco.
| | - Peter J Tonellato
- Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston, MA, 02115, USA. .,Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02215, USA.
| | - Dennis P Wall
- Department of Pediatrics and Psychiatry (by courtesy), Division of Systems Medicine & Program in Biomedical Informatics, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
13
|
Boulund F, Sjögren A, Kristiansson E. Tentacle: distributed quantification of genes in metagenomes. Gigascience 2015; 4:40. [PMID: 26351566 PMCID: PMC4562114 DOI: 10.1186/s13742-015-0078-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Accepted: 08/05/2015] [Indexed: 01/28/2023] Open
Abstract
Background In metagenomics, microbial communities are sequenced at increasingly high resolution, generating datasets with billions of DNA fragments. Novel methods that can efficiently process the growing volumes of sequence data are necessary for the accurate analysis and interpretation of existing and upcoming metagenomes. Findings Here we present Tentacle, which is a novel framework that uses distributed computational resources for gene quantification in metagenomes. Tentacle is implemented using a dynamic master-worker approach in which DNA fragments are streamed via a network and processed in parallel on worker nodes. Tentacle is modular, extensible, and comes with support for six commonly used sequence aligners. It is easy to adapt Tentacle to different applications in metagenomics and easy to integrate into existing workflows. Conclusions Evaluations show that Tentacle scales very well with increasing computing resources. We illustrate the versatility of Tentacle on three different use cases. Tentacle is written for Linux in Python 2.7 and is published as open source under the GNU General Public License (v3). Documentation, tutorials, installation instructions, and the source code are freely available online at: http://bioinformatics.math.chalmers.se/tentacle. Electronic supplementary material The online version of this article (doi:10.1186/s13742-015-0078-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Fredrik Boulund
- Division of Statistics, Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| | - Anders Sjögren
- Division of Statistics, Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| | - Erik Kristiansson
- Division of Statistics, Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
| |
Collapse
|
14
|
Shringarpure SS, Carroll A, De La Vega FM, Bustamante CD. Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes. PLoS One 2015; 10:e0129277. [PMID: 26110529 PMCID: PMC4482534 DOI: 10.1371/journal.pone.0129277] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 05/06/2015] [Indexed: 01/22/2023] Open
Abstract
Population scale sequencing of whole human genomes is becoming economically feasible; however, data management and analysis remains a formidable challenge for many research groups. Large sequencing studies, like the 1000 Genomes Project, have improved our understanding of human demography and the effect of rare genetic variation in disease. Variant calling on datasets of hundreds or thousands of genomes is time-consuming, expensive, and not easily reproducible given the myriad components of a variant calling pipeline. Here, we describe a cloud-based pipeline for joint variant calling in large samples using the Real Time Genomics population caller. We deployed the population caller on the Amazon cloud with the DNAnexus platform in order to achieve low-cost variant calling. Using our pipeline, we were able to identify 68.3 million variants in 2,535 samples from Phase 3 of the 1000 Genomes Project. By performing the variant calling in a parallel manner, the data was processed within 5 days at a compute cost of $7.33 per sample (a total cost of $18,590 for completed jobs and $21,805 for all jobs). Analysis of cost dependence and running time on the data size suggests that, given near linear scalability, cloud computing can be a cheap and efficient platform for analyzing even larger sequencing studies in the future.
Collapse
Affiliation(s)
| | | | - Francisco M. De La Vega
- Department of Genetics, Stanford University, Stanford, California 94305, USA
- Real Time Genomics, Inc. San Bruno, California 94066, USA
| | - Carlos D. Bustamante
- Department of Genetics, Stanford University, Stanford, California 94305, USA
- * E-mail:
| |
Collapse
|
15
|
Gao X, Xu J, Starmer J. Fastq2vcf: a concise and transparent pipeline for whole-exome sequencing data analyses. BMC Res Notes 2015; 8:72. [PMID: 25889517 PMCID: PMC4376134 DOI: 10.1186/s13104-015-1027-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 02/23/2015] [Indexed: 12/26/2022] Open
Abstract
Background Whole-exome sequencing (WES) is a popular next-generation sequencing technology used by numerous laboratories with various levels of statistical and analytical expertise. Centralized databases, such as the Sequence Read Archive and the European Nucleotide Archive, allow data to be reanalyzed by independent labs to confirm results and derive additional insights. Access to new and shared data highlights the necessity for software that both lowers the statistical and analytical expertise required to generate results and promotes reproducible methodology among laboratories. Findings We have developed fastq2vcf, a pipeline that automates the genomic variant calling process using multiple callers. Fastq2vcf offers improved flexibility, efficiency, and reproducibility by seamlessly integrating several leading sequencing analysis tools. It outputs not only the annotated variant call set for each caller, but also the consensus variant call set shared by different callers. Furthermore, it can be customized and extended easily. Conclusions Our software tool automatically generates executable command lines for a variety of tools required for analyzing WES data. It is also highly configurable and provides users with complete control of the processing procedure, making it easy to submit and track jobs in both single workstation and parallelized computing environments. By using this pipeline, WES analysis can be easily reproduced.
Collapse
Affiliation(s)
- Xiaoyi Gao
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL, 60612, USA.
| | - Jianpeng Xu
- Department of Ophthalmology and Visual Sciences, University of Illinois at Chicago, Chicago, IL, 60612, USA.
| | - Joshua Starmer
- Department of Genetics, University of North Carolina-Chapel Hill, Chapel Hill, NC, 27599, USA. .,Carolina Center for Genome Sciences, The University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA. .,Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
16
|
Maji RK, Sarkar A, Khatua S, Dasgupta S, Ghosh Z. PVT: an efficient computational procedure to speed up next-generation sequence analysis. BMC Bioinformatics 2014; 15:167. [PMID: 24894600 PMCID: PMC4063226 DOI: 10.1186/1471-2105-15-167] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2014] [Accepted: 05/07/2014] [Indexed: 12/05/2022] Open
Abstract
Background High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the different types of NGS data, there are certain common challenging steps involved in analysing those data. Spliced alignment is one such fundamental step in NGS data analysis which is extremely computational intensive as well as time consuming. There exists serious problem even with the most widely used spliced alignment tools. TopHat is one such widely used spliced alignment tools which although supports multithreading, does not efficiently utilize computational resources in terms of CPU utilization and memory. Here we have introduced PVT (Pipelined Version of TopHat) where we take up a modular approach by breaking TopHat’s serial execution into a pipeline of multiple stages, thereby increasing the degree of parallelization and computational resource utilization. Thus we address the discrepancies in TopHat so as to analyze large NGS data efficiently. Results We analysed the SRA dataset (SRX026839 and SRX026838) consisting of single end reads and SRA data SRR1027730 consisting of paired-end reads. We used TopHat v2.0.8 to analyse these datasets and noted the CPU usage, memory footprint and execution time during spliced alignment. With this basic information, we designed PVT, a pipelined version of TopHat that removes the redundant computational steps during ‘spliced alignment’ and breaks the job into a pipeline of multiple stages (each comprising of different step(s)) to improve its resource utilization, thus reducing the execution time. Conclusions PVT provides an improvement over TopHat for spliced alignment of NGS data analysis. PVT thus resulted in the reduction of the execution time to ~23% for the single end read dataset. Further, PVT designed for paired end reads showed an improved performance of ~41% over TopHat (for the chosen data) with respect to execution time. Moreover we propose PVT-Cloud which implements PVT pipeline in cloud computing system.
Collapse
Affiliation(s)
| | | | | | | | - Zhumur Ghosh
- Bioinformatics Centre, Bose Institute, Kolkata 700054, India.
| |
Collapse
|