1
|
Copeland I, Wonkam-Tingang E, Gupta-Malhotra M, Hashmi SS, Han Y, Jajoo A, Hall NJ, Hernandez PP, Lie N, Liu D, Xu J, Rosenfeld J, Haldipur A, Desire Z, Coban-Akdemir ZH, Scott DA, Li Q, Chao HT, Zaske AM, Lupski JR, Milewicz DM, Shete S, Posey JE, Hanchard NA. Exome sequencing implicates ancestry-related Mendelian variation at SYNE1 in childhood-onset essential hypertension. JCI Insight 2024; 9:e172152. [PMID: 38716726 PMCID: PMC11141928 DOI: 10.1172/jci.insight.172152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 03/19/2024] [Indexed: 05/12/2024] Open
Abstract
Childhood-onset essential hypertension (COEH) is an uncommon form of hypertension that manifests in childhood or adolescence and, in the United States, disproportionately affects children of African ancestry. The etiology of COEH is unknown, but its childhood onset, low prevalence, high heritability, and skewed ancestral demography suggest the potential to identify rare genetic variation segregating in a Mendelian manner among affected individuals and thereby implicate genes important to disease pathogenesis. However, no COEH genes have been reported to date. Here, we identify recessive segregation of rare and putatively damaging missense variation in the spectrin domain of spectrin repeat containing nuclear envelope protein 1 (SYNE1), a cardiovascular candidate gene, in 3 of 16 families with early-onset COEH without an antecedent family history. By leveraging exome sequence data from an additional 48 COEH families, 1,700 in-house trios, and publicly available data sets, we demonstrate that compound heterozygous SYNE1 variation in these COEH individuals occurred more often than expected by chance and that this class of biallelic rare variation was significantly enriched among individuals of African genetic ancestry. Using in vitro shRNA knockdown of SYNE1, we show that reduced SYNE1 expression resulted in a substantial decrease in the elasticity of smooth muscle vascular cells that could be rescued by pharmacological inhibition of the downstream RhoA/Rho-associated protein kinase pathway. These results provide insights into the molecular genetics and underlying pathophysiology of COEH and suggest a role for precision therapeutics in the future.
Collapse
Affiliation(s)
- Ian Copeland
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Edmond Wonkam-Tingang
- Childhood Complex Disease Genomics Section, National Human Genome Research Institute, NIH, Bethesda, USA
| | | | - S. Shahrukh Hashmi
- Department of Pediatrics, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yixing Han
- Childhood Complex Disease Genomics Section, National Human Genome Research Institute, NIH, Bethesda, USA
| | - Aarti Jajoo
- Childhood Complex Disease Genomics Section, National Human Genome Research Institute, NIH, Bethesda, USA
| | - Nancy J. Hall
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- US Department of Agriculture Agricultural Research Service Children’s Nutrition Research Center, Baylor College of Medicine, Houston, Texas, USA
| | - Paula P. Hernandez
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- US Department of Agriculture Agricultural Research Service Children’s Nutrition Research Center, Baylor College of Medicine, Houston, Texas, USA
| | - Natasha Lie
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Childhood Complex Disease Genomics Section, National Human Genome Research Institute, NIH, Bethesda, USA
- US Department of Agriculture Agricultural Research Service Children’s Nutrition Research Center, Baylor College of Medicine, Houston, Texas, USA
| | - Dan Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Jun Xu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Jill Rosenfeld
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Baylor Genetics, Houston, Texas, USA
| | - Aparna Haldipur
- Childhood Complex Disease Genomics Section, National Human Genome Research Institute, NIH, Bethesda, USA
| | - Zelene Desire
- Childhood Complex Disease Genomics Section, National Human Genome Research Institute, NIH, Bethesda, USA
| | - Zeynep H. Coban-Akdemir
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Daryl A. Scott
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Texas Children’s Hospital, Houston, Texas, USA
- Department of Molecular Physiology and Biophysics
| | - Qing Li
- Childhood Complex Disease Genomics Section, National Human Genome Research Institute, NIH, Bethesda, USA
| | - Hsiao-Tuan Chao
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Division of Neurology and Developmental Neuroscience, Department of Pediatrics; and
- Department of Neuroscience, Baylor College of Medicine, Houston, Texas, USA
- Cain Pediatric Neurology Research Foundation Laboratories, Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital and Baylor College of Medicine, Houston, Texas, USA
- McNair Medical Institute, The Robert and Janice McNair Foundation, Houston, Texas, USA
| | - Ana M. Zaske
- Department of Pediatrics, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - James R. Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Texas Children’s Hospital, Houston, Texas, USA
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
| | - Dianna M. Milewicz
- Department of Internal Medicine, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Sanjay Shete
- The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Jennifer E. Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- McNair Medical Institute, The Robert and Janice McNair Foundation, Houston, Texas, USA
| | - Neil A. Hanchard
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
- Childhood Complex Disease Genomics Section, National Human Genome Research Institute, NIH, Bethesda, USA
| |
Collapse
|
2
|
Ahmed Z, Renart EG, Mishra D, Zeeshan S. JWES: a new pipeline for whole genome/exome sequence data processing, management, and gene-variant discovery, annotation, prediction, and genotyping. FEBS Open Bio 2021; 11:2441-2452. [PMID: 34370400 PMCID: PMC8409305 DOI: 10.1002/2211-5463.13261] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 07/18/2021] [Accepted: 08/02/2021] [Indexed: 01/07/2023] Open
Abstract
Whole genome and exome sequencing (WGS/WES) are the most popular next‐generation sequencing (NGS) methodologies and are at present often used to detect rare and common genetic variants of clinical significance. We emphasize that automated sequence data processing, management, and visualization should be an indispensable component of modern WGS and WES data analysis for sequence assembly, variant detection (SNPs, SVs), imputation, and resolution of haplotypes. In this manuscript, we present a newly developed findable, accessible, interoperable, and reusable (FAIR) bioinformatics‐genomics pipeline Java based Whole Genome/Exome Sequence Data Processing Pipeline (JWES) for efficient variant discovery and interpretation, and big data modeling and visualization. JWES is a cross‐platform, user‐friendly, product line application, that entails three modules: (a) data processing, (b) storage, and (c) visualization. The data processing module performs a series of different tasks for variant calling, the data storage module efficiently manages high‐volume gene‐variant data, and the data visualization module supports variant data interpretation with Circos graphs. The performance of JWES was tested and validated in‐house with different experiments, using Microsoft Windows, macOS Big Sur, and UNIX operating systems. JWES is an open‐source and freely available pipeline, allowing scientists to take full advantage of all the computing resources available, without requiring much computer science knowledge. We have successfully applied JWES for processing, management, and gene‐variant discovery, annotation, prediction, and genotyping of WGS and WES data to analyze variable complex disorders. In summary, we report the performance of JWES with some reproducible case studies, using open access and in‐house generated, high‐quality datasets.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Deepshikha Mishra
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
3
|
Ahmed Z, Renart EG, Zeeshan S. Genomics pipelines to investigate susceptibility in whole genome and exome sequenced data for variant discovery, annotation, prediction and genotyping. PeerJ 2021; 9:e11724. [PMID: 34395068 PMCID: PMC8320519 DOI: 10.7717/peerj.11724] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Accepted: 06/14/2021] [Indexed: 12/12/2022] Open
Abstract
Over the last few decades, genomics is leading toward audacious future, and has been changing our views about conducting biomedical research, studying diseases, and understanding diversity in our society across the human species. The whole genome and exome sequencing (WGS/WES) are two of the most popular next-generation sequencing (NGS) methodologies that are currently being used to detect genetic variations of clinical significance. Investigating WGS/WES data for the variant discovery and genotyping is based on the nexus of different data analytic applications. Although several bioinformatics applications have been developed, and many of those are freely available and published. Timely finding and interpreting genetic variants are still challenging tasks among diagnostic laboratories and clinicians. In this study, we are interested in understanding, evaluating, and reporting the current state of solutions available to process the NGS data of variable lengths and types for the identification of variants, alleles, and haplotypes. Residing within the scope, we consulted high quality peer reviewed literature published in last 10 years. We were focused on the standalone and networked bioinformatics applications proposed to efficiently process WGS and WES data, and support downstream analysis for gene-variant discovery, annotation, prediction, and interpretation. We have discussed our findings in this manuscript, which include but not are limited to the set of operations, workflow, data handling, involved tools, technologies and algorithms and limitations of the assessed applications.
Collapse
Affiliation(s)
- Zeeshan Ahmed
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA.,Department of Medicine, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Eduard Gibert Renart
- Institute for Health, Health Care Policy and Aging Research, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| | - Saman Zeeshan
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
| |
Collapse
|
4
|
Bhardwaj A, Bag SK. PLANET-SNP pipeline: PLants based ANnotation and Establishment of True SNP pipeline. Genomics 2019; 111:1066-1077. [PMID: 31533899 DOI: 10.1016/j.ygeno.2018.07.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Revised: 06/10/2018] [Accepted: 07/02/2018] [Indexed: 12/30/2022]
Abstract
Acute prediction of SNPs (Single Nucleotide Polymorphisms) from high throughput sequencing data is a challenging problem, having potential to explore possible variation within plants species. For the extraction of profitable information from bulk of data, machine learning (ML) could lead to development of accurate model based on the learning of prior information. We performed state of art, in-depth learning on six different plant species. Comparative evaluation of five different algorithms showed that Random Forest substantially outperformed in selection of potential SNPs, with markedly improved prediction accuracy via 10-fold cross validation technique and integrated in system known as PLANET-SNP. We present the accurate method to extract the potential SNPs with user specific customizable parameters. It will facilitate the identification of efficient and functional SNPs in most easy and intuitive way. PLANET-SNP pipeline is very flexible in terms of data input and output formats. PLANET-SNP Pipeline is available at http://www.ncgd.nbri.res.in/PLANET-SNP-Pipeline.aspx.
Collapse
Affiliation(s)
- Archana Bhardwaj
- Academy of Scientific and Innovative Research (AcSIR), CSIR-NBRI Campus, Lucknow, India; Computational Biology Lab, Council of Scientific and Industrial Research - National Botanical Research Institute (CSIR-NBRI), Rana Pratap Marg, Lucknow, Uttar Pradesh 226001, India
| | - Sumit K Bag
- Academy of Scientific and Innovative Research (AcSIR), CSIR-NBRI Campus, Lucknow, India; Computational Biology Lab, Council of Scientific and Industrial Research - National Botanical Research Institute (CSIR-NBRI), Rana Pratap Marg, Lucknow, Uttar Pradesh 226001, India.
| |
Collapse
|
5
|
Rasnic R, Brandes N, Zuk O, Linial M. Substantial batch effects in TCGA exome sequences undermine pan-cancer analysis of germline variants. BMC Cancer 2019; 19:783. [PMID: 31391007 PMCID: PMC6686424 DOI: 10.1186/s12885-019-5994-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2018] [Accepted: 07/30/2019] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND In recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from > 10,000 patients. METHODS Our hypothesis in this study is that whole exome sequences from blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity. RESULTS We report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants. CONCLUSION TCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.
Collapse
Affiliation(s)
- Roni Rasnic
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
| | - Nadav Brandes
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Or Zuk
- Department of Statistics, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
6
|
Wang Y, Li G, Ma M, He F, Song Z, Zhang W, Wu C. GT-WGS: an efficient and economic tool for large-scale WGS analyses based on the AWS cloud service. BMC Genomics 2018; 19:959. [PMID: 29363427 PMCID: PMC5780748 DOI: 10.1186/s12864-017-4334-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Whole-genome sequencing (WGS) plays an increasingly important role in clinical practice and public health. Due to the big data size, WGS data analysis is usually compute-intensive and IO-intensive. Currently it usually takes 30 to 40 h to finish a 50× WGS analysis task, which is far from the ideal speed required by the industry. Furthermore, the high-end infrastructure required by WGS computing is costly in terms of time and money. In this paper, we aim to improve the time efficiency of WGS analysis and minimize the cost by elastic cloud computing. RESULTS We developed a distributed system, GT-WGS, for large-scale WGS analyses utilizing the Amazon Web Services (AWS). Our system won the first prize on the Wind and Cloud challenge held by Genomics and Cloud Technology Alliance conference (GCTA) committee. The system makes full use of the dynamic pricing mechanism of AWS. We evaluate the performance of GT-WGS with a 55× WGS dataset (400GB fastq) provided by the GCTA 2017 competition. In the best case, it only took 18.4 min to finish the analysis and the AWS cost of the whole process is only 16.5 US dollars. The accuracy of GT-WGS is 99.9% consistent with that of the Genome Analysis Toolkit (GATK) best practice. We also evaluated the performance of GT-WGS performance on a real-world dataset provided by the XiangYa hospital, which consists of 5× whole-genome dataset with 500 samples, and on average GT-WGS managed to finish one 5× WGS analysis task in 2.4 min at a cost of $3.6. CONCLUSIONS WGS is already playing an important role in guiding therapeutic intervention. However, its application is limited by the time cost and computing cost. GT-WGS excelled as an efficient and affordable WGS analyses tool to address this problem. The demo video and supplementary materials of GT-WGS can be accessed at https://github.com/Genetalks/wgs_analysis_demo .
Collapse
Affiliation(s)
- Yiqi Wang
- School of Computer Science, National University of Defense Technology, Changsha, 410000, China
| | - Gen Li
- Genetalks Biotech. Co., Ltd, Beijing, 100000, China
| | - Mark Ma
- Genetalks Biotech. Co., Ltd, Beijing, 100000, China
| | - Fazhong He
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, Changsha, 410000, China
| | - Zhuo Song
- Genetalks Biotech. Co., Ltd, Beijing, 100000, China.
| | - Wei Zhang
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, Changsha, 410000, China.
| | - Chengkun Wu
- School of Computer Science, National University of Defense Technology, Changsha, 410000, China
| |
Collapse
|
7
|
Whole-exome sequencing and microRNA profiling reveal PI3K/AKT pathway’s involvement in juvenile myelomonocytic leukemia. QUANTITATIVE BIOLOGY 2018. [DOI: 10.1007/s40484-017-0125-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
8
|
Mashl RJ, Scott AD, Huang KL, Wyczalkowski MA, Yoon CJ, Niu B, DeNardo E, Yellapantula VD, Handsaker RE, Chen K, Koboldt DC, Ye K, Fenyö D, Raphael BJ, Wendl MC, Ding L. GenomeVIP: a cloud platform for genomic variant discovery and interpretation. Genome Res 2017; 27:1450-1459. [PMID: 28522612 PMCID: PMC5538560 DOI: 10.1101/gr.211656.116] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2016] [Accepted: 05/03/2017] [Indexed: 12/12/2022]
Abstract
Identifying genomic variants is a fundamental first step toward the understanding of the role of inherited and acquired variation in disease. The accelerating growth in the corpus of sequencing data that underpins such analysis is making the data-download bottleneck more evident, placing substantial burdens on the research community to keep pace. As a result, the search for alternative approaches to the traditional “download and analyze” paradigm on local computing resources has led to a rapidly growing demand for cloud-computing solutions for genomics analysis. Here, we introduce the Genome Variant Investigation Platform (GenomeVIP), an open-source framework for performing genomics variant discovery and annotation using cloud- or local high-performance computing infrastructure. GenomeVIP orchestrates the analysis of whole-genome and exome sequence data using a set of robust and popular task-specific tools, including VarScan, GATK, Pindel, BreakDancer, Strelka, and Genome STRiP, through a web interface. GenomeVIP has been used for genomic analysis in large-data projects such as the TCGA PanCanAtlas and in other projects, such as the ICGC Pilots, CPTAC, ICGC-TCGA DREAM Challenges, and the 1000 Genomes SV Project. Here, we demonstrate GenomeVIP's ability to provide high-confidence annotated somatic, germline, and de novo variants of potential biological significance using publicly available data sets.
Collapse
Affiliation(s)
- R Jay Mashl
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Adam D Scott
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Kuan-Lin Huang
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | | | - Christopher J Yoon
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Beifang Niu
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Erin DeNardo
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Venkata D Yellapantula
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Robert E Handsaker
- Stanley Center for Psychiatric Research, Broad Institute, Cambridge, Massachusetts 02142, USA.,Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Ken Chen
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| | - Daniel C Koboldt
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Kai Ye
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - David Fenyö
- Langone Medical Center, New York University, New York, New York 10016, USA
| | - Benjamin J Raphael
- Department of Computer Science and Center for Computational Molecular Biology, Brown University, Providence, Rhode Island 02912, USA
| | - Michael C Wendl
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Department of Genetics, Washington University, St. Louis, Missouri 63108, USA.,Department of Mathematics, Washington University, St. Louis, Missouri 63108, USA
| | - Li Ding
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA.,Department of Genetics, Washington University, St. Louis, Missouri 63108, USA.,Siteman Cancer Center, Washington University, St. Louis, Missouri 63108, USA
| |
Collapse
|
9
|
He KY, Ge D, He MM. Big Data Analytics for Genomic Medicine. Int J Mol Sci 2017; 18:ijms18020412. [PMID: 28212287 PMCID: PMC5343946 DOI: 10.3390/ijms18020412] [Citation(s) in RCA: 104] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Revised: 02/08/2017] [Accepted: 02/09/2017] [Indexed: 12/25/2022] Open
Abstract
Genomic medicine attempts to build individualized strategies for diagnostic or therapeutic decision-making by utilizing patients’ genomic information. Big Data analytics uncovers hidden patterns, unknown correlations, and other insights through examining large-scale various data sets. While integration and manipulation of diverse genomic data and comprehensive electronic health records (EHRs) on a Big Data infrastructure exhibit challenges, they also provide a feasible opportunity to develop an efficient and effective approach to identify clinically actionable genetic variants for individualized diagnosis and therapy. In this paper, we review the challenges of manipulating large-scale next-generation sequencing (NGS) data and diverse clinical data derived from the EHRs for genomic medicine. We introduce possible solutions for different challenges in manipulating, managing, and analyzing genomic and clinical data to implement genomic medicine. Additionally, we also present a practical Big Data toolset for identifying clinically actionable genetic variants using high-throughput NGS data and EHRs.
Collapse
Affiliation(s)
- Karen Y He
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, USA.
| | | | - Max M He
- BioSciKin Co., Ltd., Nanjing 210042, China.
- Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, WI 53706, USA.
| |
Collapse
|
10
|
MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants. BMC Bioinformatics 2017; 18:49. [PMID: 28107819 PMCID: PMC5248509 DOI: 10.1186/s12859-016-1454-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Accepted: 12/24/2016] [Indexed: 12/28/2022] Open
Abstract
Background Next Generation Genome sequencing techniques became affordable for massive sequencing efforts devoted to clinical characterization of human diseases. However, the cost of providing cloud-based data analysis of the mounting datasets remains a concerning bottleneck for providing cost-effective clinical services. To address this computational problem, it is important to optimize the variant analysis workflow and the used analysis tools to reduce the overall computational processing time, and concomitantly reduce the processing cost. Furthermore, it is important to capitalize on the use of the recent development in the cloud computing market, which have witnessed more providers competing in terms of products and prices. Results In this paper, we present a new package called MC-GenomeKey (Multi-Cloud GenomeKey) that efficiently executes the variant analysis workflow for detecting and annotating mutations using cloud resources from different commercial cloud providers. Our package supports Amazon, Google, and Azure clouds, as well as, any other cloud platform based on OpenStack. Our package allows different scenarios of execution with different levels of sophistication, up to the one where a workflow can be executed using a cluster whose nodes come from different clouds. MC-GenomeKey also supports scenarios to exploit the spot instance model of Amazon in combination with the use of other cloud platforms to provide significant cost reduction. To the best of our knowledge, this is the first solution that optimizes the execution of the workflow using computational resources from different cloud providers. Conclusions MC-GenomeKey provides an efficient multicloud based solution to detect and annotate mutations. The package can run in different commercial cloud platforms, which enables the user to seize the best offers. The package also provides a reliable means to make use of the low-cost spot instance model of Amazon, as it provides an efficient solution to the sudden termination of spot machines as a result of a sudden price increase. The package has a web-interface and it is available for free for academic use.
Collapse
|
11
|
Roy-Chowdhuri S, Roy S, Monaco SE, Routbort MJ, Pantanowitz L. Big data from small samples: Informatics of next-generation sequencing in cytopathology. Cancer Cytopathol 2016; 125:236-244. [PMID: 27918649 DOI: 10.1002/cncy.21805] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Revised: 10/13/2016] [Accepted: 10/17/2016] [Indexed: 12/12/2022]
Abstract
The rapid adoption of next-generation sequencing (NGS) in clinical molecular laboratories has redefined the practice of cytopathology. Instead of simply being used as a diagnostic tool, cytopathology has evolved into a practice providing important genomic information that guides clinical management. The recent emphasis on maximizing limited-volume cytology samples for ancillary molecular studies, including NGS, requires cytopathologists not only to be more involved in specimen collection and processing techniques but also to be aware of downstream testing and informatics issues. For the integration of molecular informatics into the clinical workflow, it is important to understand the computational components of the NGS workflow by which raw sequence data are transformed into clinically actionable genomic information and to address the challenges of having a robust and sustainable informatics infrastructure for NGS-based testing in a clinical environment. Adapting to needs ranging from specimen procurement to report delivery is crucial for the optimal utilization of cytology specimens to accommodate requests from clinicians to improve patient care. This review presents a broad overview of the various aspects of informatics in the context of NGS-based testing of cytology specimens. Cancer Cytopathol 2017;125:236-244. © 2016 American Cancer Society.
Collapse
Affiliation(s)
- Sinchita Roy-Chowdhuri
- Division of Pathology and Laboratory Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Somak Roy
- Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania
| | - Sara E Monaco
- Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania
| | - Mark J Routbort
- Division of Pathology and Laboratory Medicine, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Liron Pantanowitz
- Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania
| |
Collapse
|
12
|
Tebani A, Afonso C, Marret S, Bekri S. Omics-Based Strategies in Precision Medicine: Toward a Paradigm Shift in Inborn Errors of Metabolism Investigations. Int J Mol Sci 2016; 17:ijms17091555. [PMID: 27649151 PMCID: PMC5037827 DOI: 10.3390/ijms17091555] [Citation(s) in RCA: 105] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Revised: 09/06/2016] [Accepted: 09/07/2016] [Indexed: 12/20/2022] Open
Abstract
The rise of technologies that simultaneously measure thousands of data points represents the heart of systems biology. These technologies have had a huge impact on the discovery of next-generation diagnostics, biomarkers, and drugs in the precision medicine era. Systems biology aims to achieve systemic exploration of complex interactions in biological systems. Driven by high-throughput omics technologies and the computational surge, it enables multi-scale and insightful overviews of cells, organisms, and populations. Precision medicine capitalizes on these conceptual and technological advancements and stands on two main pillars: data generation and data modeling. High-throughput omics technologies allow the retrieval of comprehensive and holistic biological information, whereas computational capabilities enable high-dimensional data modeling and, therefore, accessible and user-friendly visualization. Furthermore, bioinformatics has enabled comprehensive multi-omics and clinical data integration for insightful interpretation. Despite their promise, the translation of these technologies into clinically actionable tools has been slow. In this review, we present state-of-the-art multi-omics data analysis strategies in a clinical context. The challenges of omics-based biomarker translation are discussed. Perspectives regarding the use of multi-omics approaches for inborn errors of metabolism (IEM) are presented by introducing a new paradigm shift in addressing IEM investigations in the post-genomic era.
Collapse
Affiliation(s)
- Abdellah Tebani
- Department of Metabolic Biochemistry, Rouen University Hospital, 76031 Rouen, France.
- Normandie University, UNIROUEN, INSERM, CHU Rouen, Laboratoire NeoVasc ERI28, 76000 Rouen, France.
- Normandie University, UNIROUEN, INSA Rouen, CNRS, COBRA, 76000 Rouen, France.
| | - Carlos Afonso
- Normandie University, UNIROUEN, INSA Rouen, CNRS, COBRA, 76000 Rouen, France.
| | - Stéphane Marret
- Normandie University, UNIROUEN, INSERM, CHU Rouen, Laboratoire NeoVasc ERI28, 76000 Rouen, France.
- Department of Neonatal Pediatrics, Intensive Care and Neuropediatrics, Rouen University Hospital, 76031 Rouen, France.
| | - Soumeya Bekri
- Department of Metabolic Biochemistry, Rouen University Hospital, 76031 Rouen, France.
- Normandie University, UNIROUEN, INSERM, CHU Rouen, Laboratoire NeoVasc ERI28, 76000 Rouen, France.
| |
Collapse
|
13
|
Huang Z, Rustagi N, Veeraraghavan N, Carroll A, Gibbs R, Boerwinkle E, Venkata MG, Yu F. A hybrid computational strategy to address WGS variant analysis in >5000 samples. BMC Bioinformatics 2016; 17:361. [PMID: 27612449 PMCID: PMC5018196 DOI: 10.1186/s12859-016-1211-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Accepted: 08/25/2016] [Indexed: 11/22/2022] Open
Abstract
Background The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. Results We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Conclusions Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1211-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zhuoyi Huang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Navin Rustagi
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | | | | | - Richard Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Eric Boerwinkle
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.,Human Genetics Center, University of Texas Health Science Center, Houston, TX, USA
| | | | - Fuli Yu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
14
|
Leelananda SP, Kloczkowski A, Jernigan RL. Fold-specific sequence scoring improves protein sequence matching. BMC Bioinformatics 2016; 17:328. [PMID: 27578239 PMCID: PMC5006591 DOI: 10.1186/s12859-016-1198-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Accepted: 08/24/2016] [Indexed: 11/10/2022] Open
Abstract
Background Sequence matching is extremely important for applications throughout biology, particularly for discovering information such as functional and evolutionary relationships, and also for discriminating between unimportant and disease mutants. At present the functions of a large fraction of genes are unknown; improvements in sequence matching will improve gene annotations. Universal amino acid substitution matrices such as Blosum62 are used to measure sequence similarities and to identify distant homologues, regardless of the structure class. However, such single matrices do not take into account important structural information evident within the different topologies of proteins and treats substitutions within all protein folds identically. Others have suggested that the use of structural information can lead to significant improvements in sequence matching but this has not yet been very effective. Here we develop novel substitution matrices that include not only general sequence information but also have a topology specific component that is unique for each CATH topology. This novel feature of using a combination of sequence and structure information for each protein topology significantly improves the sequence matching scores for the sequence pairs tested. We have used a novel multi-structure alignment method for each homology level of CATH in order to extract topological information. Results We obtain statistically significant improved sequence matching scores for 73 % of the alpha helical test cases. On average, 61 % of the test cases showed improvements in homology detection when structure information was incorporated into the substitution matrices. On average z-scores for homology detection are improved by more than 54 % for all cases, and some individual cases have z-scores more than twice those obtained using generic matrices. Our topology specific similarity matrices also outperform other traditional similarity matrices and single matrix based structure methods. When default amino acid substitution matrix in the Psi-blast algorithm is replaced by our structure-based matrices, the structure matching is significantly improved over conventional Psi-blast. It also outperforms results obtained for the corresponding HMM profiles generated for each topology. Conclusions We show that by incorporating topology-specific structure information in addition to sequence information into specific amino acid substitution matrices, the sequence matching scores and homology detection are significantly improved. Our topology specific similarity matrices outperform other traditional similarity matrices, single matrix based structure methods, also show improvement over conventional Psi-blast and HMM profile based methods in sequence matching. The results support the discriminatory ability of the new amino acid similarity matrices to distinguish between distant homologs and structurally dissimilar pairs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1198-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sumudu P Leelananda
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.,Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.,Present Address: 2120 Newman and Wolfrom Laboratory, The Ohio State University, 100 W 18th Ave, Columbus, OH, 43210, USA.,Present Address: Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA
| | - Andrzej Kloczkowski
- Present Address: Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA.,Present Address: Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43205, USA
| | - Robert L Jernigan
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA. .,Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.
| |
Collapse
|
15
|
Menon R, Patel NV, Mohapatra A, Joshi CG. VDAP-GUI: a user-friendly pipeline for variant discovery and annotation of raw next-generation sequencing data. 3 Biotech 2016; 6:68. [PMID: 28330138 PMCID: PMC4754298 DOI: 10.1007/s13205-016-0382-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Accepted: 10/15/2015] [Indexed: 12/03/2022] Open
Abstract
Even though next-generation sequencing (NGS) has become an invaluable tool in molecular biology, several laboratories with NGS facilities lack trained Bioinformaticians for data analysis. Here, focusing on the variant detection application of NGS analysis, we have developed a fully automated pipeline, namely Variant Discovery and Annotation Tool-Graphical User Interface (VDAP-GUI), which detects and annotates single nucleotide polymorphisms and insertions/deletions from raw sequence reads. VDAP-GUI consolidates several proven methods in each step such as quality control, trimming, mapping, variant detection and annotation. It supports multiple NGS platforms and has four methodological choices for variant detection. Further, it can re-analyze existing data with alternate thresholds and generates easily interpretable reports in html and tab-delimited formats. Using VDAP-GUI, we have analyzed a publically available human whole-exome sequence dataset. VDAP-GUI is developed using Perl/Tk programming, and is available for free download and use at http://sourceforge.net/projects/vdapgui/.
Collapse
|
16
|
Roy S, LaFramboise WA, Nikiforov YE, Nikiforova MN, Routbort MJ, Pfeifer J, Nagarajan R, Carter AB, Pantanowitz L. Next-Generation Sequencing Informatics: Challenges and Strategies for Implementation in a Clinical Environment. Arch Pathol Lab Med 2016; 140:958-75. [PMID: 26901284 DOI: 10.5858/arpa.2015-0507-ra] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
CONTEXT -Next-generation sequencing (NGS) is revolutionizing the discipline of laboratory medicine, with a deep and direct impact on patient care. Although it empowers clinical laboratories with unprecedented genomic sequencing capability, NGS has brought along obvious and obtrusive informatics challenges. Bioinformatics and clinical informatics are separate disciplines with typically a small degree of overlap, but they have been brought together by the enthusiastic adoption of NGS in clinical laboratories. The result has been a collaborative environment for the development of novel informatics solutions. Sustaining NGS-based testing in a regulated clinical environment requires institutional support to build and maintain a practical, robust, scalable, secure, and cost-effective informatics infrastructure. OBJECTIVE -To discuss the novel NGS informatics challenges facing pathology laboratories today and offer solutions and future developments to address these obstacles. DATA SOURCES -The published literature pertaining to NGS informatics was reviewed. The coauthors, experts in the fields of molecular pathology, precision medicine, and pathology informatics, also contributed their experiences. CONCLUSIONS -The boundary between bioinformatics and clinical informatics has significantly blurred with the introduction of NGS into clinical molecular laboratories. Next-generation sequencing technology and the data derived from these tests, if managed well in the clinical laboratory, will redefine the practice of medicine. In order to sustain this progress, adoption of smart computing technology will be essential. Computational pathologists will be expected to play a major role in rendering diagnostic and theranostic services by leveraging "Big Data" and modern computing tools.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Liron Pantanowitz
- From the Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania (Drs Roy, LaFramboise, Nikiforov, Nikiforova, and Pantanowitz); the Department of Pathology, MD Anderson Cancer Center, Houston, Texas (Dr Routbort); the Department of Pathology and Immunology, Washington University School of Medicine, St Louis, Missouri (Drs Pfeifer and Nagarajan); PierianDx, St Louis, Missouri (Dr Nagarajan); and the Department of Pathology and Laboratory Medicine, Children's Healthcare of Atlanta, Atlanta, Georgia (Dr Carter)
| |
Collapse
|
17
|
Reisman S, Hatzopoulos T, Läufer K, Thiruvathukal GK, Putonti C. A Polyglot Approach to Bioinformatics Data Integration: A Phylogenetic Analysis of HIV-1. Evol Bioinform Online 2016; 12:23-7. [PMID: 26819543 PMCID: PMC4718148 DOI: 10.4137/ebo.s32757] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2015] [Revised: 10/18/2015] [Accepted: 10/25/2015] [Indexed: 02/04/2023] Open
Abstract
As sequencing technologies continue to drop in price and increase in throughput, new challenges emerge for the management and accessibility of genomic sequence data. We have developed a pipeline for facilitating the storage, retrieval, and subsequent analysis of molecular data, integrating both sequence and metadata. Taking a polyglot approach involving multiple languages, libraries, and persistence mechanisms, sequence data can be aggregated from publicly available and local repositories. Data are exposed in the form of a RESTful web service, formatted for easy querying, and retrieved for downstream analyses. As a proof of concept, we have developed a resource for annotated HIV-1 sequences. Phylogenetic analyses were conducted for >6,000 HIV-1 sequences revealing spatial and temporal factors influence the evolution of the individual genes uniquely. Nevertheless, signatures of origin can be extrapolated even despite increased globalization. The approach developed here can easily be customized for any species of interest.
Collapse
Affiliation(s)
- Steven Reisman
- Bioinformatics Program, Loyola University Chicago, Chicago, IL, USA.; Department of Computer Science, Loyola University Chicago, Chicago, IL, USA.; Department of Biology, Loyola University Chicago, Chicago, IL, USA
| | - Thomas Hatzopoulos
- Bioinformatics Program, Loyola University Chicago, Chicago, IL, USA.; Department of Computer Science, Loyola University Chicago, Chicago, IL, USA
| | - Konstantin Läufer
- Bioinformatics Program, Loyola University Chicago, Chicago, IL, USA.; Department of Computer Science, Loyola University Chicago, Chicago, IL, USA
| | - George K Thiruvathukal
- Bioinformatics Program, Loyola University Chicago, Chicago, IL, USA.; Department of Computer Science, Loyola University Chicago, Chicago, IL, USA
| | - Catherine Putonti
- Bioinformatics Program, Loyola University Chicago, Chicago, IL, USA.; Department of Computer Science, Loyola University Chicago, Chicago, IL, USA.; Department of Biology, Loyola University Chicago, Chicago, IL, USA
| |
Collapse
|
18
|
A Comprehensive Review of Emerging Computational Methods for Gene Identification. JOURNAL OF INFORMATION PROCESSING SYSTEMS 2016. [DOI: 10.3745/jips.04.0023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
19
|
Kovatch P, Costa A, Giles Z, Fluder E, Cho HM, Mazurkova S. Big Omics Data Experience. SC ... CONFERENCE PROCEEDINGS. SC (CONFERENCE : SUPERCOMPUTING) 2015; 2015. [PMID: 30788464 DOI: 10.1145/2807591.2807595] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
As personalized medicine becomes more integrated into healthcare, the rate at which human genomes are being sequenced is rising quickly together with a concomitant acceleration in compute and storage requirements. To achieve the most effective solution for genomic workloads without re-architecting the industry-standard software, we performed a rigorous analysis of usage statistics, benchmarks and available technologies to design a system for maximum throughput. We share our experiences designing a system optimized for the "Genome Analysis ToolKit (GATK) Best Practices" whole genome DNA and RNA pipeline based on an evaluation of compute, workload and I/O characteristics. The characteristics of genomic-based workloads are vastly different from those of traditional HPC workloads, requiring different configurations of the scheduler and the I/O subsystem to achieve reliability, performance and scalability. By understanding how our researchers and clinicians work, we were able to employ techniques not only to speed up their workflow yielding improved and repeatable performance, but also to make more efficient use of storage and compute resources.
Collapse
Affiliation(s)
- Patricia Kovatch
- Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029, 212-241-6500
| | - Anthony Costa
- Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029, 212-241-6500
| | - Zachary Giles
- Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029, 212-241-6500
| | - Eugene Fluder
- Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029, 212-241-6500
| | - Hyung Min Cho
- Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029, 212-241-6500
| | - Svetlana Mazurkova
- Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029, 212-241-6500
| |
Collapse
|
20
|
Shringarpure SS, Carroll A, De La Vega FM, Bustamante CD. Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes. PLoS One 2015; 10:e0129277. [PMID: 26110529 PMCID: PMC4482534 DOI: 10.1371/journal.pone.0129277] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 05/06/2015] [Indexed: 01/22/2023] Open
Abstract
Population scale sequencing of whole human genomes is becoming economically feasible; however, data management and analysis remains a formidable challenge for many research groups. Large sequencing studies, like the 1000 Genomes Project, have improved our understanding of human demography and the effect of rare genetic variation in disease. Variant calling on datasets of hundreds or thousands of genomes is time-consuming, expensive, and not easily reproducible given the myriad components of a variant calling pipeline. Here, we describe a cloud-based pipeline for joint variant calling in large samples using the Real Time Genomics population caller. We deployed the population caller on the Amazon cloud with the DNAnexus platform in order to achieve low-cost variant calling. Using our pipeline, we were able to identify 68.3 million variants in 2,535 samples from Phase 3 of the 1000 Genomes Project. By performing the variant calling in a parallel manner, the data was processed within 5 days at a compute cost of $7.33 per sample (a total cost of $18,590 for completed jobs and $21,805 for all jobs). Analysis of cost dependence and running time on the data size suggests that, given near linear scalability, cloud computing can be a cheap and efficient platform for analyzing even larger sequencing studies in the future.
Collapse
Affiliation(s)
| | | | - Francisco M. De La Vega
- Department of Genetics, Stanford University, Stanford, California 94305, USA
- Real Time Genomics, Inc. San Bruno, California 94066, USA
| | - Carlos D. Bustamante
- Department of Genetics, Stanford University, Stanford, California 94305, USA
- * E-mail:
| |
Collapse
|
21
|
Ma J, Purcell H, Showalter L, Aagaard KM. Mitochondrial DNA sequence variation is largely conserved at birth with rare de novo mutations in neonates. Am J Obstet Gynecol 2015; 212:530.e1-8. [PMID: 25687567 DOI: 10.1016/j.ajog.2015.02.009] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2014] [Revised: 01/29/2015] [Accepted: 02/09/2015] [Indexed: 12/21/2022]
Abstract
OBJECTIVE Mitochondrial DNA (mtDNA) encodes the proteins of the electron transfer chain to produce adenosine triphosphate through oxidative phosphorylation, and is essential to sustain life. mtDNA is unique from the nuclear genome in so much as it is solely maternally inherited (non-mendelian patterning), and shows a relatively high rate of mutation due to the absence of error checking capacity. While it is generally assumed that most new mutations accumulate through the process of heteroplasmy, it is unknown whether mutations initiated in the mother are inherited, occur in utero, or occur and accumulate early in life. The purpose of this study is to examine the maternally heritable and de novo mutation rate in the fetal mtDNA through high-fidelity sequencing from a large population-based cohort. STUDY DESIGN Samples were obtained from 90 matched maternal (blood) and fetal (placental) pairs. In addition, a smaller cohort (n = 5) of maternal (blood), fetal (placental), and neonatal (cord blood) trios were subjected to DNA extraction and shotgun sequencing. The whole genome was sequenced on the Illumina HiSeq platform (Illumina Inc., San Diego, CA), and haplogroups and mtDNA variants were identified through mapping to reference mitochondrial genomes (NC_012920). RESULTS We observed 665 single nucleotide polymorphisms and 82 insertions-deletions variants identified in the cohort at large. We achieved high sequencing depth of the mtDNA to an average depth of 65X (range, 20-171X) coverage. The proportions of haplogroups identified in the cohort are consistent with the patient's self-identified ethnicity (>90% Hispanic), and all maternal-fetal pairs mapped to the identical haplogroup. Only variants from samples with average depth >20X and allele frequency >1% were included for further analysis. While the majority of the maternal-fetal pairs (>90%) demonstrated identical variants at the single nucleotide level, we observed rare mitochondrial single nucleotide polymorphism discordance between maternal and fetal mitochondrial genomes. CONCLUSION In this first in-depth sequencing analysis of mtDNA from maternal-fetal pairs at the time of birth, a low rate of de novo mutations appears in the fetal mitochondrial genome. This implies that these mutations likely arise from the maternal heteroplasmic pool (eg, in the oocyte), and accumulate later in the offspring's life. These findings have key implications for both the occurrence and screening for mitochondrial disorders.
Collapse
|
22
|
Kelly BJ, Fitch JR, Hu Y, Corsmeier DJ, Zhong H, Wetzel AN, Nordquist RD, Newsom DL, White P. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol 2015; 16:6. [PMID: 25600152 PMCID: PMC4333267 DOI: 10.1186/s13059-014-0577-x] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2014] [Accepted: 12/23/2014] [Indexed: 12/18/2022] Open
Abstract
While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Peter White
- Center for Microbial Pathogenesis, The Research Institute at Nationwide Children's Hospital, 700 Children's Drive, Columbus 43205, OH, USA.,Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, Ohio, USA
| |
Collapse
|
23
|
Kumar P, Al-Shafai M, Al Muftah WA, Chalhoub N, Elsaid MF, Aleem AA, Suhre K. Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance. BMC Res Notes 2014; 7:747. [PMID: 25339461 PMCID: PMC4216909 DOI: 10.1186/1756-0500-7-747] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Accepted: 10/03/2014] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND With diminishing costs of next generation sequencing (NGS), whole genome analysis becomes a standard tool for identifying genetic causes of inherited diseases. Commercial NGS service providers in general not only provide raw genomic reads, but further deliver SNP calls to their clients. However, the question for the user arises whether to use the SNP data as is, or process the raw sequencing data further through more sophisticated SNP calling pipelines with more advanced algorithms. RESULTS Here we report a detailed comparison of SNPs called using the popular GATK multiple-sample calling protocol to SNPs delivered as part of a 40x whole genome sequencing project by Illumina Inc of 171 human genomes of Arab descent (108 unrelated Qatari genomes, 19 trios, and 2 families with rare diseases) and compare them to variants provided by the Illumina CASAVA pipeline. GATK multi-sample calling identifies more variants than the CASAVA pipeline. The additional variants from GATK are robust for Mendelian consistencies but weak in terms of statistical parameters such as TsTv ratio. However, these additional variants do not make a difference in detecting the causative variants in the studied phenotype. CONCLUSION Both pipelines, GATK multi-sample calling and Illumina CASAVA single sample calling, have highly similar performance in SNP calling at the level of putatively causative variants.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Karsten Suhre
- Weill Cornell Medical College in Qatar, Education City, Doha, Qatar.
| |
Collapse
|
24
|
Madduri RK, Sulakhe D, Lacinski L, Liu B, Rodriguez A, Chard K, Dave UJ, Foster IT. Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services. CONCURRENCY AND COMPUTATION : PRACTICE & EXPERIENCE 2014; 26:2266-2279. [PMID: 25342933 PMCID: PMC4203657 DOI: 10.1002/cpe.3274] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system achieves a high degree of end-to-end automation that encompasses every stage of data analysis including initial data retrieval from remote sequencing centers or storage (via the Globus file transfer system); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); creation of custom Amazon Machine Images and on-demand resource acquisition via a specialized elastic provisioner (on Amazon EC2); and efficient scheduling of these pipelines over many processors (via the HTCondor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets in a fully automated manner, without software installation or a need for any local computing infrastructure. We report performance and cost results for some representative workloads.
Collapse
Affiliation(s)
- Ravi K Madduri
- Computation Institute University of Chicago and Argonne National Laboratory Chicago, IL
| | - Dinanath Sulakhe
- Computation Institute University of Chicago and Argonne National Laboratory Chicago, IL
| | - Lukasz Lacinski
- Computation Institute University of Chicago and Argonne National Laboratory Chicago, IL
| | - Bo Liu
- Computation Institute University of Chicago and Argonne National Laboratory Chicago, IL
| | - Alex Rodriguez
- Computation Institute University of Chicago and Argonne National Laboratory Chicago, IL
| | - Kyle Chard
- Computation Institute University of Chicago and Argonne National Laboratory Chicago, IL
| | - Utpal J Dave
- Computation Institute University of Chicago and Argonne National Laboratory Chicago, IL
| | - Ian T Foster
- Computation Institute University of Chicago and Argonne National Laboratory Chicago, IL
| |
Collapse
|
25
|
Shyr C, Kushniruk A, Wasserman WW. Usability study of clinical exome analysis software: top lessons learned and recommendations. J Biomed Inform 2014; 51:129-36. [PMID: 24860971 DOI: 10.1016/j.jbi.2014.05.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2013] [Revised: 04/30/2014] [Accepted: 05/06/2014] [Indexed: 10/25/2022]
Abstract
OBJECTIVES New DNA sequencing technologies have revolutionized the search for genetic disruptions. Targeted sequencing of all protein coding regions of the genome, called exome analysis, is actively used in research-oriented genetics clinics, with the transition to exomes as a standard procedure underway. This transition is challenging; identification of potentially causal mutation(s) amongst ∼10(6) variants requires specialized computation in combination with expert assessment. This study analyzes the usability of user interfaces for clinical exome analysis software. There are two study objectives: (1) To ascertain the key features of successful user interfaces for clinical exome analysis software based on the perspective of expert clinical geneticists, (2) To assess user-system interactions in order to reveal strengths and weaknesses of existing software, inform future design, and accelerate the clinical uptake of exome analysis. METHODS Surveys, interviews, and cognitive task analysis were performed for the assessment of two next-generation exome sequence analysis software packages. The subjects included ten clinical geneticists who interacted with the software packages using the "think aloud" method. Subjects' interactions with the software were recorded in their clinical office within an urban research and teaching hospital. All major user interface events (from the user interactions with the packages) were time-stamped and annotated with coding categories to identify usability issues in order to characterize desired features and deficiencies in the user experience. RESULTS We detected 193 usability issues, the majority of which concern interface layout and navigation, and the resolution of reports. Our study highlights gaps in specific software features typical within exome analysis. The clinicians perform best when the flow of the system is structured into well-defined yet customizable layers for incorporation within the clinical workflow. The results highlight opportunities to dramatically accelerate clinician analysis and interpretation of patient genomic data. CONCLUSION We present the first application of usability methods to evaluate software interfaces in the context of exome analysis. Our results highlight how the study of user responses can lead to identification of usability issues and challenges and reveal software reengineering opportunities for improving clinical next-generation sequencing analysis. While the evaluation focused on two distinctive software tools, the results are general and should inform active and future software development for genome analysis software. As large-scale genome analysis becomes increasingly common in healthcare, it is critical that efficient and effective software interfaces are provided to accelerate clinical adoption of the technology. Implications for improved design of such applications are discussed.
Collapse
Affiliation(s)
- Casper Shyr
- Centre for Molecular Medicine and Therapeutics, Child & Family Research Institute, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada; Bioinformatics Graduate Program, University of British Columbia, 2329 West Mall, Vancouver, BC V6T 1Z4, Canada
| | - Andre Kushniruk
- School of Health Information Science, University of Victoria, 3800 Finnerty Rd., Victoria, BC V8P 5C2, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Child & Family Research Institute, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada; Department of Medical Genetics, University of British Columbia, 2329 West Mall, Vancouver, BC V6T 1Z4, Canada.
| |
Collapse
|
26
|
Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies. ISRN BIOINFORMATICS 2013; 2013:481545. [PMID: 25937948 PMCID: PMC4393068 DOI: 10.1155/2013/481545] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/08/2013] [Accepted: 08/07/2013] [Indexed: 01/31/2023]
Abstract
RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.
Collapse
|
27
|
Lin CF, Valladares O, Childress DM, Klevak E, Geller ET, Hwang YC, Tsai EA, Schellenberg GD, Wang LS. DRAW+SneakPeek: analysis workflow and quality metric management for DNA-seq experiments. Bioinformatics 2013; 29:2498-500. [PMID: 23943636 PMCID: PMC3777113 DOI: 10.1093/bioinformatics/btt422] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Summary: We report our new DRAW+SneakPeek software for DNA-seq analysis. DNA resequencing analysis workflow (DRAW) automates the workflow of processing raw sequence reads including quality control, read alignment and variant calling on high-performance computing facilities such as Amazon elastic compute cloud. SneakPeek provides an effective interface for reviewing dozens of quality metrics reported by DRAW, so users can assess the quality of data and diagnose problems in their sequencing procedures. Both DRAW and SneakPeek are freely available under the MIT license, and are available as Amazon machine images to be used directly on Amazon cloud with minimal installation. Availability: DRAW+SneakPeek is released under the MIT license and is available for academic and nonprofit use for free. The information about source code, Amazon machine images and instructions on how to install and run DRAW+SneakPeek locally and on Amazon elastic compute cloud is available at the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (http://www.niagads.org/) and Wang lab Web site (http://wanglab.pcbi.upenn.edu/). Contact:gerardsc@mail.med.upenn.edu or lswang@mail.med.upenn.edu
Collapse
Affiliation(s)
- Chiao-Feng Lin
- Department of Pathology and Laboratory Medicine and Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA, Department of Physics, University of Washington, Seattle, WA 98105, USA, Genomics and Computational Biology Graduate Group, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA and Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | | | | | | | | | | | | | | | | |
Collapse
|