1
|
Koh GCC, Nanda AS, Rinaldi G, Boushaki S, Degasperi A, Badja C, Pregnall AM, Zhao SJ, Chmelova L, Black D, Heskin L, Dias J, Young J, Memari Y, Shooter S, Czarnecki J, Brown MA, Davies HR, Zou X, Nik-Zainal S. A redefined InDel taxonomy provides insights into mutational signatures. Nat Genet 2025; 57:1132-1141. [PMID: 40210680 DOI: 10.1038/s41588-025-02152-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 03/04/2025] [Indexed: 04/12/2025]
Abstract
Despite their deleterious effects, small insertions and deletions (InDels) have received far less attention than substitutions. Here we generated isogenic CRISPR-edited human cellular models of postreplicative repair dysfunction (PRRd), including individual and combined gene edits of DNA mismatch repair (MMR) and replicative polymerases (Pol ε and Pol δ). Unique, diverse InDel mutational footprints were revealed. However, the prevailing InDel classification framework was unable to discriminate these InDel signatures from background mutagenesis and from each other. To address this, we developed an alternative InDel classification system that considers flanking sequences and informative motifs (for example, longer homopolymers), enabling unambiguous InDel classification into 89 subtypes. Through focused characterization of seven tumor types from the 100,000 Genomes Project, we uncovered 37 InDel signatures; 27 were new. In addition to unveiling previously hidden biological insights, we also developed PRRDetect-a highly specific classifier of PRRd status in tumors, with potential implications for immunotherapies.
Collapse
Affiliation(s)
- Gene Ching Chiek Koh
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
- School of Medical and Life Sciences, Sunway University, Sunway City, Malaysia
| | - Arjun Scott Nanda
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Giuseppe Rinaldi
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Soraya Boushaki
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Andrea Degasperi
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Cherif Badja
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Andrew Marcel Pregnall
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Salome Jingchen Zhao
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Lucia Chmelova
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Daniella Black
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Laura Heskin
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - João Dias
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Jamie Young
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Yasin Memari
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Scott Shooter
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Jan Czarnecki
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Matthew Arthur Brown
- Genomics England, Queen Mary University of London, Dawson Hall, Charterhouse Square, London, UK
| | - Helen Ruth Davies
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Xueqing Zou
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Serena Nik-Zainal
- Department of Genomic Medicine, School of Clinical Medicine, University of Cambridge, Cambridge, UK.
- Early Cancer Institute, Department of Oncology, University of Cambridge, Cambridge, UK.
| |
Collapse
|
2
|
Sealock JM, Ivankovic F, Liao C, Chen S, Churchhouse C, Karczewski KJ, Howrigan DP, Neale BM. Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses. Nat Protoc 2025:10.1038/s41596-025-01169-1. [PMID: 40155705 DOI: 10.1038/s41596-025-01169-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 03/04/2025] [Indexed: 04/01/2025]
Abstract
Genetic sequencing technologies are powerful tools for identifying rare variants and genes associated with Mendelian and complex traits; indeed, whole-exome and whole-genome sequencing are increasingly popular methods for population-scale genetic studies. However, careful quality control steps should be taken to ensure study accuracy and reproducibility, and sequencing data require extensive quality filtering to delineate true variants from technical artifacts. Although processing standards are harmonized across pipelines to call variants from sequencing reads, there currently exists no standardized pipeline for conducting quality filtering on variant-level datasets for the purpose of population-scale association analysis. In this Tutorial, we discuss key quality control parameters, provide guidelines for conducting quality filtering of samples and variants, and compare commonly used software programs for quality control of samples, variants and genotypes from sequencing data. As sequencing data continue to gain popularity in genetic research, establishing standardized quality control practices is crucial to ensure consistent, reliable and reproducible results across studies.
Collapse
Affiliation(s)
- Julia M Sealock
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA.
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Franjo Ivankovic
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Calwing Liao
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Siwei Chen
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Claire Churchhouse
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Konrad J Karczewski
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Daniel P Howrigan
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Benjamin M Neale
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
3
|
Zhou X, Li J, Chen L, Guo M, Liang R, Pan Y. The genomic pattern of insertion/deletion variations during rice improvement. BMC Genomics 2024; 25:1263. [PMID: 39741238 DOI: 10.1186/s12864-024-11178-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2024] [Accepted: 12/20/2024] [Indexed: 01/02/2025] Open
Abstract
BACKGROUND Rice, as one of the most important staple crops, its genetic improvement plays a crucial role in agricultural production and food security. Although extensive research has utilized single nucleotide polymorphisms (SNPs) data to explore the genetic basis of important agronomic traits in rice improvement, reports on the role of other types of variations, such as insertions and deletions (INDELs), are still limited. RESULTS In this study, we extracted INDELs from resequencing data of 148 rice improved varieties. We identified 938,585 INDELs and found that as the length of the variation increases, the number of variations decreases, with 89.0% of INDELs being 2-10 bp. The highest number of INDELs was found on chromosome 1, while the least was on chromosome 10. INDELs were unevenly distributed across the genome, generating a total of 33 hotspot regions. 47.0% of INDELs were located within 2 kb upstream and downstream of genes. Using phenotypic data from five agronomic traits (heading date, flag leaf length, flag leaf width, panicle number, and plant height) along with INDEL data to perform genome-wide association study (GWAS), we identified 6,331 significant loci involving 157 cloned genes. Haplotype analysis of candidate genes revealed INDELs affecting important functional genes, such as OsMED25 and OsRRMh related to heading date, and MOC2 related to plant height. CONCLUSIONS Our work analyzed the variation patterns of INDELs in rice improvement and identified INDELs associated with agronomic traits. These results will provide valuable genetic and material resources for the genetic improvement of rice.
Collapse
Affiliation(s)
- Xia Zhou
- Urban Construction School, Beijing City University, Beijing, 101300, China
| | - Jilong Li
- State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, Chinese Academy of Sciences, Beijing, 100093, China.
- China National Botanical Garden, Beijing, 100093, China.
| | - Lei Chen
- Rice Research Institute, Guangxi Key Laboratory of Rice Genetics and Breeding, Guangxi Academy of Agricultural Sciences, Nanning, 530007, China
| | - Minjie Guo
- Peanut Institute, Kaifeng Academy of Agricultural and Forestry Sciences, Kaifeng, 475004, China
| | - Renmin Liang
- Hechi Agricultural Science Research Institute, Guangxi Academy of Agricultural Sciences, Hechi, 546306, China
| | - Yinghua Pan
- Rice Research Institute, Guangxi Key Laboratory of Rice Genetics and Breeding, Guangxi Academy of Agricultural Sciences, Nanning, 530007, China.
| |
Collapse
|
4
|
Sarwal V, Niehus S, Ayyala R, Kim M, Sarkar A, Chang S, Lu A, Rajkumar N, Darci-Maher N, Littman R, Chhugani K, Soylev A, Comarova Z, Wesel E, Castellanos J, Chikka R, Distler MG, Eskin E, Flint J, Mangul S. A comprehensive benchmarking of WGS-based deletion structural variant callers. Brief Bioinform 2022; 23:bbac221. [PMID: 35753701 PMCID: PMC9294411 DOI: 10.1093/bib/bbac221] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Revised: 04/30/2022] [Accepted: 05/11/2022] [Indexed: 01/10/2023] Open
Abstract
Advances in whole-genome sequencing (WGS) promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from WGS data presents a substantial number of challenges and a plethora of SV detection methods have been developed. Currently, evidence that investigators can use to select appropriate SV detection tools is lacking. In this article, we have evaluated the performance of SV detection tools on mouse and human WGS data using a comprehensive polymerase chain reaction-confirmed gold standard set of SVs and the genome-in-a-bottle variant set, respectively. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of the SV detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance as the SV detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low- and ultralow-pass sequencing data as well as for different deletion length categories.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
- Indian Institute of Technology Delhi, Hauz Khas, New Delhi, Delhi 110016, India
| | - Sebastian Niehus
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178 Berlin, Germany
- Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Charitéplatz 1, 10117 Berlin, Germany
| | - Ram Ayyala
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Minyoung Kim
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089
| | - Aditya Sarkar
- School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, Kamand, Mandi, Himachal Pradesh 175001, India
| | - Sei Chang
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Angela Lu
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Neha Rajkumar
- Department of Bioengineering, Department of Bioengineering, University of California Los Angeles, Los Angeles, CA, 90095
| | - Nicholas Darci-Maher
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Russell Littman
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Karishma Chhugani
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California 1985 Zonal Avenue Los Angeles, CA 90089-9121
| | - Arda Soylev
- Department of Computer Engineering, Konya Food and Agriculture University, Konya, Turkey
| | - Zoia Comarova
- Department Civil and Environmental Engineering, University of Southern California, Los Angeles, CA, United States
| | - Emily Wesel
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Jacqueline Castellanos
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Rahul Chikka
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Margaret G Distler
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA 90095, USA
- Department of Human Genetics, David Geffen School of Medicine at UCLA, 695 Charles E. Young Drive South, Box 708822, Los Angeles, CA, 90095, USA
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, 73-235 CHS, Los Angeles, CA, 90095, USA
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, 760 Westwood Plaza, Los Angeles, CA 90095, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California 1985 Zonal Avenue Los Angeles, CA 90089-9121
| |
Collapse
|
5
|
Kubicová Z, Roussel S, Félix B, Cabanová L. Genomic Diversity of Listeria monocytogenes Isolates From Slovakia (2010 to 2020). Front Microbiol 2021; 12:729050. [PMID: 34795648 PMCID: PMC8593459 DOI: 10.3389/fmicb.2021.729050] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 10/01/2021] [Indexed: 12/13/2022] Open
Abstract
Over the past 11 years, the Slovak National Reference Laboratory has collected a panel of 988 Listeria monocytogenes isolates in Slovakia, which were isolated from various food sectors (61%), food-processing environments (13.7%), animals with listeriosis symptoms (21.2%), and human cases (4.1%). We serotyped these isolates by agglutination method, which revealed the highest prevalence (61.1%) of serotype 1/2a and the lowest (4.7%) of serotype 1/2c, although these represented the majority of isolates from the meat sector. The distribution of CCs analyzed on 176 isolates demonstrated that CC11-ST451 (15.3%) was the most prevalent CC, particularly in food (14.8%) and animal isolates (17.5%). CC11-ST451, followed by CC7, CC14, and CC37, were the most prevalent CCs in the milk sector, and CC9 and CC8 in the meat sector. CC11-ST451 is probably widely distributed in Slovakia, mainly in the milk and dairy product sectors, posing a possible threat to public health. Potential persistence indication of CC9 was observed in one meat facility between 2014 and 2018, highlighting its general meat-related distribution and potential for persistence worldwide.
Collapse
Affiliation(s)
- Zuzana Kubicová
- State Veterinary and Food Institute (SVFI), Dolny Kubin, Slovakia
| | - Sophie Roussel
- Maisons-Alfort Laboratory for Food Safety, Salmonella and Listeria Unit, University of Paris-Est, French Agency for Food, Environmental and Occupational Health & Safety (ANSES), Maisons-Alfort, France
| | - Benjamin Félix
- Maisons-Alfort Laboratory for Food Safety, Salmonella and Listeria Unit, University of Paris-Est, French Agency for Food, Environmental and Occupational Health & Safety (ANSES), Maisons-Alfort, France
| | - Lenka Cabanová
- State Veterinary and Food Institute (SVFI), Dolny Kubin, Slovakia
| |
Collapse
|
6
|
Zanti M, Michailidou K, Loizidou MA, Machattou C, Pirpa P, Christodoulou K, Spyrou GM, Kyriacou K, Hadjisavvas A. Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels. BMC Bioinformatics 2021; 22:218. [PMID: 33910496 PMCID: PMC8080428 DOI: 10.1186/s12859-021-04144-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 04/15/2021] [Indexed: 11/10/2022] Open
Abstract
Background Next-generation sequencing (NGS) represents a significant advancement in clinical genetics. However, its use creates several technical, data interpretation and management challenges. It is essential to follow a consistent data analysis pipeline to achieve the highest possible accuracy and avoid false variant calls. Herein, we aimed to compare the performance of twenty-eight combinations of NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM, Bowtie2, Stampy), variant calling (GATK-HaplotypeCaller, GATK-UnifiedGenotyper, SAMtools) and interval padding (null, 50 bp, 100 bp) methods, along with a commercially available pipeline (BWA Enrichment, Illumina®). Fourteen germline DNA samples from breast cancer patients were sequenced using a targeted NGS panel approach and subjected to data analysis. Results We highlight that interval padding is required for the accurate detection of intronic variants including spliceogenic pathogenic variants (PVs). In addition, using nearly default parameters, the BWA Enrichment algorithm, failed to detect these spliceogenic PVs and a missense PV in the TP53 gene. We also recommend the BWA-MEM algorithm for sequence alignment, whereas variant calling should be performed using a combination of variant calling algorithms; GATK-HaplotypeCaller and SAMtools for the accurate detection of insertions/deletions and GATK-UnifiedGenotyper for the efficient detection of single nucleotide variant calls. Conclusions These findings have important implications towards the identification of clinically actionable variants through panel testing in a clinical laboratory setting, when dedicated bioinformatics personnel might not always be available. The results also reveal the necessity of improving the existing tools and/or at the same time developing new pipelines to generate more reliable and more consistent data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04144-1.
Collapse
Affiliation(s)
- Maria Zanti
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriaki Michailidou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Biostatistics Unit, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Maria A Loizidou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Christina Machattou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Panagiota Pirpa
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyproula Christodoulou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Neurogenetics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - George M Spyrou
- Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.,Bioinformatics Department, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus
| | - Kyriacos Kyriacou
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus.,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus
| | - Andreas Hadjisavvas
- Department of Electron Microscopy/Molecular Pathology, The Cyprus Institute of Neurology and Genetics, 2371, Nicosia, Cyprus. .,Cyprus School of Molecular Medicine, 2371, Nicosia, Cyprus.
| |
Collapse
|
7
|
Zhang W, Ghosh D. A general approach to sensitivity analysis for Mendelian randomization. STATISTICS IN BIOSCIENCES 2021; 13:34-55. [PMID: 33737984 DOI: 10.1007/s12561-020-09280-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Mendelian Randomization (MR) represents a class of instrumental variable methods using genetic variants. It has become popular in epidemiological studies to account for the unmeasured confounders when estimating the effect of exposure on outcome. The success of Mendelian Randomization depends on three critical assumptions, which are difficult to verify. Therefore, sensitivity analysis methods are needed for evaluating results and making plausible conclusions. We propose a general and easy to apply approach to conduct sensitivity analysis for Mendelian Randomization studies. Bound et al. (1995) derived a formula for the asymptotic bias of the instrumental variable estimator. Based on their work, we derive a new sensitivity analysis formula. The parameters in the formula include sensitivity parameters such as the correlation between instruments and unmeasured confounder, the direct effect of instruments on outcome and the strength of instruments. In our simulation studies, we examined our approach in various scenarios using either individual SNPs or unweighted allele score as instruments. By using a previously published dataset from researchers involving a bone mineral density study, we demonstrate that our proposed method is a useful tool for MR studies, and that investigators can combine their domain knowledge with our method to obtain bias-corrected results and make informed conclusions on the scientific plausibility of their findings.
Collapse
Affiliation(s)
- Weiming Zhang
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, Colorado, U.S.A
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, Colorado, U.S.A
| |
Collapse
|
8
|
Hashimoto S, Noguchi E, Bando H, Miyadera H, Morii W, Nakamura T, Hara H. Neoantigen prediction in human breast cancer using RNA sequencing data. Cancer Sci 2021; 112:465-475. [PMID: 33155341 PMCID: PMC7780012 DOI: 10.1111/cas.14720] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 10/08/2020] [Accepted: 11/02/2020] [Indexed: 12/30/2022] Open
Abstract
Neoantigens have attracted attention as biomarkers or therapeutic targets. However, accurate prediction of neoantigens is still challenging, especially in terms of its accuracy and cost. Variant detection using RNA sequencing (RNA-seq) data has been reported to be a low-accuracy but cost-effective tool, but the feasibility of RNA-seq data for neoantigen prediction has not been fully examined. In the present study, we used whole-exome sequencing (WES) and RNA-seq data of tumor and matched normal samples from six breast cancer patients to evaluate the utility of RNA-seq data instead of WES data in variant calling to detect neoantigen candidates. Somatic variants were called in three protocols using: (i) tumor and normal WES data (DNA method, Dm); (ii) tumor and normal RNA-seq data (RNA method, Rm); and (iii) combination of tumor RNA-seq and normal WES data (Combination method, Cm). We found that the Rm had both high false-positive and high false-negative rates because this method depended greatly on the expression status of normal transcripts. When we compared the results of Dm with those of Cm, only 14% of the neoantigen candidates detected in Dm were identified in Cm, but the majority of the missed candidates lacked coverage or variant allele reads in the tumor RNA. In contrast, about 70% of the neoepitope candidates with higher expression and rich mutant transcripts could be detected in Cm. Our results showed that Cm could be an efficient and a cost-effective approach to predict highly expressed neoantigens in tumor samples.
Collapse
Affiliation(s)
- Sachie Hashimoto
- Department of Breast and Endocrine SurgeryGraduate School of Comprehensive Human SciencesUniversity of TsukubaIbarakiJapan
| | - Emiko Noguchi
- Department of Medical GeneticsFaculty of MedicineUniversity of TsukubaIbarakiJapan
| | - Hiroko Bando
- Department of Breast and Endocrine SurgeryFaculty of MedicineUniversity of TsukubaIbarakiJapan
| | - Hiroko Miyadera
- Department of Medical GeneticsFaculty of MedicineUniversity of TsukubaIbarakiJapan
| | - Wataru Morii
- Department of Medical GeneticsFaculty of MedicineUniversity of TsukubaIbarakiJapan
| | - Takako Nakamura
- Department of Medical GeneticsFaculty of MedicineUniversity of TsukubaIbarakiJapan
| | - Hisato Hara
- Department of Breast and Endocrine SurgeryFaculty of MedicineUniversity of TsukubaIbarakiJapan
| |
Collapse
|
9
|
Skuza L, Filip E, Szućko I, Bocianowski J. SPInDel Analysis of the Non-Coding Regions of cpDNA as a More Useful Tool for the Identification of Rye (Poaceae: Secale) Species. Int J Mol Sci 2020; 21:ijms21249421. [PMID: 33321948 PMCID: PMC7762986 DOI: 10.3390/ijms21249421] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Revised: 12/07/2020] [Accepted: 12/08/2020] [Indexed: 01/09/2023] Open
Abstract
Secale is a small but very diverse genus from the tribe Triticeae (family Poaceae), which includes annual, perennial, self-pollinating and open-pollinating, cultivated, weedy and wild species of various phenotypes. Despite its high economic importance, classification of this genus, comprising 3–8 species, is inconsistent. This has resulted in significantly reduced progress in the breeding of rye which could be enriched with functional traits derived from wild rye species. Our previous research has suggested the utility of non-coding sequences of chloroplast and mitochondrial DNA in studies on closely related species of the genus Secale. Here we applied the SPInDel (Species Identification by Insertions/Deletions) approach, which targets hypervariable genomic regions containing multiple insertions/deletions (indels) and exhibiting extensive length variability. We analysed a total of 140 and 210 non-coding sequences from cpDNA and mtDNA, respectively. The resulting data highlight regions which may represent useful molecular markers with respect to closely related species of the genus Secale, however, we found the chloroplast genome to be more informative. These molecular markers include non-coding regions of chloroplast DNA: atpB-rbcL and trnT-trnL and non-coding regions of mitochondrial DNA: nad1B-nad1C and rrn5/rrn18. Our results demonstrate the utility of the SPInDel concept for the characterisation of Secale species.
Collapse
Affiliation(s)
- Lidia Skuza
- Institute of Biology, University of Szczecin, 13 Wąska, 71-415 Szczecin, Poland; (E.F.); (I.S.)
- The Centre for Molecular Biology and Biotechnology, University of Szczecin, 13 Wąska, 71-415 Szczecin, Poland
- Correspondence:
| | - Ewa Filip
- Institute of Biology, University of Szczecin, 13 Wąska, 71-415 Szczecin, Poland; (E.F.); (I.S.)
- The Centre for Molecular Biology and Biotechnology, University of Szczecin, 13 Wąska, 71-415 Szczecin, Poland
| | - Izabela Szućko
- Institute of Biology, University of Szczecin, 13 Wąska, 71-415 Szczecin, Poland; (E.F.); (I.S.)
- The Centre for Molecular Biology and Biotechnology, University of Szczecin, 13 Wąska, 71-415 Szczecin, Poland
| | - Jan Bocianowski
- Department of Mathematical and Statistical Methods, Faculty of Agronomy and Bioengineering, Poznań University of Life Sciences, 28 Wojska Polskiego, 60-637 Poznań, Poland;
| |
Collapse
|
10
|
Rabinowitz T, Shomron N. Genome-wide noninvasive prenatal diagnosis of monogenic disorders: Current and future trends. Comput Struct Biotechnol J 2020; 18:2463-2470. [PMID: 33005308 PMCID: PMC7509788 DOI: 10.1016/j.csbj.2020.09.003] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 08/17/2020] [Accepted: 09/01/2020] [Indexed: 02/09/2023] Open
Abstract
Noninvasive prenatal diagnosis (NIPD) is a risk-free alternative to invasive methods for prenatal diagnosis, e.g. amniocentesis. NIPD is based on the presence of fetal DNA within the mother’s plasma cell-free DNA (cfDNA). Though currently available for various monogenic diseases through detection of point mutations, NIPD is limited to detecting one mutation or up to several genes simultaneously. Noninvasive prenatal whole exome/genome sequencing (WES/WGS) has demonstrated genome-wide detection of fetal point mutations in a few studies. However, Genome-wide NIPD of monogenic disorders currently has several challenges and limitations, mainly due to the small amounts of cfDNA and fetal-derived fragments, and the deep coverage required. Several approaches have been suggested for addressing these hurdles, based on various technologies and algorithms. The first relevant software tool, Hoobari, recently became available. Here we review the approaches proposed and the paths required to make genome-wide monogenic NIPD widely available in the clinic.
Collapse
Affiliation(s)
- Tom Rabinowitz
- Faculty of Medicine and Edmond J Safra Center for Bioinformatics, Tel Aviv University, Tel Aviv 69978, Israel
| | - Noam Shomron
- Faculty of Medicine and Edmond J Safra Center for Bioinformatics, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
11
|
Li M, Tang L, Wu FX, Pan Y, Wang J. CSA: a web service for the complete process of ChIP-Seq analysis. BMC Bioinformatics 2019; 20:515. [PMID: 31874601 PMCID: PMC6929326 DOI: 10.1186/s12859-019-3090-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 09/10/2019] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Chromatin immunoprecipitation sequencing (ChIP-seq) is a technology that combines chromatin immunoprecipitation (ChIP) with next generation of sequencing technology (NGS) to analyze protein interactions with DNA. At present, most ChIP-seq analysis tools adopt the command line, which lacks user-friendly interfaces. Although some web services with graphical interfaces have been developed for ChIP-seq analysis, these sites cannot provide a comprehensive analysis of ChIP-seq from raw data to downstream analysis. RESULTS In this study, we develop a web service for the whole process of ChIP-Seq Analysis (CSA), which covers mapping, quality control, peak calling, and downstream analysis. In addition, CSA provides a customization function for users to define their own workflows. And the visualization of mapping, peak calling, motif finding, and pathway analysis results are also provided in CSA. For the different types of ChIP-seq datasets, CSA can provide the corresponding tool to perform the analysis. Moreover, CSA can detect differences in ChIP signals between ChIP samples and controls to identify absolute binding sites. CONCLUSIONS The two case studies demonstrate the effectiveness of CSA, which can complete the whole procedure of ChIP-seq analysis. CSA provides a web interface for users, and implements the visualization of every analysis step. The website of CSA is available at http://CompuBio.csu.edu.cn.
Collapse
Affiliation(s)
- Min Li
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Li Tang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, SKS7N5A9, Saskatoon, Canada
| | - Yi Pan
- Department of Computer Science, Georgia State University, GA30303, Atlanta, USA
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
12
|
Li D, Kim W, Wang L, Yoon KA, Park B, Park C, Kong SY, Hwang Y, Baek D, Lee ES, Won S. Comparison of INDEL Calling Tools with Simulation Data and Real Short-Read Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1635-1644. [PMID: 30004886 DOI: 10.1109/tcbb.2018.2854793] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Insertions and deletions (INDELs) comprise a significant proportion of human genetic variation, and recent papers have revealed that many human diseases may be attributable to INDELs. With the development of next-generation sequencing (NGS) technology, many statistical/computational tools have been developed for calling INDELs. However, there are differences among those tools, and comparisons among them have been limited. In order to better understand these inter-tool differences, five popular and publicly available INDEL calling tools-GATK HaplotypeCaller, Platypus, VarScan2, Scalpel, and GotCloud-were evaluated using simulation data, 1000 Genomes Project data, and family-based sequencing data. The accuracy of INDEL calling by each tool was mainly evaluated by concordance rates. Family-based sequencing data, which consisted of 49 individuals from eight Korean families, were used to calculate Mendelian error rates. Our comparison results show that GATK HaplotypeCaller usually performs the best and that joint calling with Platypus can lead to additional improvements in accuracy. The result of this study provides important information regarding future directions for the variant detection and the algorithms development.
Collapse
|
13
|
Crespo-Piazuelo D, Criado-Mesas L, Revilla M, Castelló A, Fernández AI, Folch JM, Ballester M. Indel detection from Whole Genome Sequencing data and association with lipid metabolism in pigs. PLoS One 2019; 14:e0218862. [PMID: 31246983 PMCID: PMC6597088 DOI: 10.1371/journal.pone.0218862] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 06/11/2019] [Indexed: 12/15/2022] Open
Abstract
The selection in commercial swine breeds for meat-production efficiency has been increasing among the past decades, reducing the intramuscular fat content, which has changed the sensorial and technological properties of pork. Through processes of natural adaptation and selective breeding, the accumulation of mutations has driven the genetic divergence between pig breeds. The most common and well-studied mutations are single-nucleotide polymorphisms (SNPs). However, insertions and deletions (indels) usually represents a fifth part of the detected mutations and should also be considered for animal breeding. In the present study, three different programs (Dindel, SAMtools mpileup, and GATK) were used to detect indels from Whole Genome Sequencing data of Iberian boars and Landrace sows. A total of 1,928,746 indels were found in common with the three programs. The VEP tool predicted that 1,289 indels may have a high impact on protein sequence and function. Ten indels inside genes related with lipid metabolism were genotyped in pigs from three different backcrosses with Iberian origin, obtaining different allelic frequencies on each backcross. Genome-Wide Association Studies performed in the Longissimus dorsi muscle found an association between an indel located in the C1q and TNF related 12 (C1QTNF12) gene and the amount of eicosadienoic acid (C20:2(n-6)).
Collapse
Affiliation(s)
- Daniel Crespo-Piazuelo
- Plant and Animal Genomics, Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Departament de Ciència Animal i dels Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
- * E-mail:
| | - Lourdes Criado-Mesas
- Plant and Animal Genomics, Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
| | - Manuel Revilla
- Plant and Animal Genomics, Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Departament de Ciència Animal i dels Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Anna Castelló
- Plant and Animal Genomics, Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Departament de Ciència Animal i dels Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Ana I. Fernández
- Departamento de Mejora Genética Animal, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Madrid, Spain
| | - Josep M. Folch
- Plant and Animal Genomics, Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Departament de Ciència Animal i dels Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona (UAB), Bellaterra, Spain
| | - Maria Ballester
- Departament de Genètica i Millora Animal, Institut de Recerca i Tecnologia Agroalimentàries (IRTA), Caldes de Montbui, Spain
| |
Collapse
|
14
|
Kumaran M, Subramanian U, Devarajan B. Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinformatics 2019; 20:342. [PMID: 31208315 PMCID: PMC6580603 DOI: 10.1186/s12859-019-2928-9] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Accepted: 05/31/2019] [Indexed: 12/30/2022] Open
Abstract
Background Whole exome sequencing (WES) is a cost-effective method that identifies clinical variants but it demands accurate variant caller tools. Currently available tools have variable accuracy in predicting specific clinical variants. But it may be possible to find the best combination of aligner-variant caller tools for detecting accurate single nucleotide variants (SNVs) and small insertion and deletion (InDels) separately. Moreover, many important aspects of InDel detection are overlooked while comparing the performance of tools, particularly its base pair length. Results We assessed the performance of variant calling pipelines using the combinations of four variant callers and five aligners on human NA12878 and simulated exome data. We used high confidence variant calls from Genome in a Bottle (GiaB) consortium for validation, and GRCh37 and GRCh38 as the human reference genome. Based on the performance metrics, both BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for InDels. Furthermore, we obtained similar results on human NA24385 and NA24631 exome data from GiaB. Conclusion In this study, DeepVariant with BWA and Novoalign performed best for detecting accurate SNVs and InDels. The accuracy of variant calling was improved by merging the top performing pipelines. The results of our study provide useful recommendations for analysis of WES data in clinical genomics. Electronic supplementary material The online version of this article (10.1186/s12859-019-2928-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Manojkumar Kumaran
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India.,School of Chemical and Biotechnology, SASTRA (Deemed to be University), Thanjavur, Tamil Nadu, 613401, India
| | - Umadevi Subramanian
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India
| | - Bharanidharan Devarajan
- Department of Bioinformatics, Aravind Medical Research Foundation, Madurai, Tamil Nadu, 625020, India.
| |
Collapse
|
15
|
Rabinowitz T, Polsky A, Golan D, Danilevsky A, Shapira G, Raff C, Basel-Salmon L, Matar RT, Shomron N. Bayesian-based noninvasive prenatal diagnosis of single-gene disorders. Genome Res 2019; 29:428-438. [PMID: 30787035 PMCID: PMC6396420 DOI: 10.1101/gr.235796.118] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2018] [Accepted: 01/23/2019] [Indexed: 12/04/2022]
Abstract
In the last decade, noninvasive prenatal diagnosis (NIPD) has emerged as an effective procedure for early detection of inherited diseases during pregnancy. This technique is based on using cell-free DNA (cfDNA) and fetal cfDNA (cffDNA) in maternal blood, and hence, has minimal risk for the mother and fetus compared with invasive techniques. NIPD is currently used for identifying chromosomal abnormalities (in some instances) and for single-gene disorders (SGDs) of paternal origin. However, for SGDs of maternal origin, sensitivity poses a challenge that limits the testing to one genetic disorder at a time. Here, we present a Bayesian method for the NIPD of monogenic diseases that is independent of the mode of inheritance and parental origin. Furthermore, we show that accounting for differences in the length distribution of fetal- and maternal-derived cfDNA fragments results in increased accuracy. Our model is the first to predict inherited insertions–deletions (indels). The method described can serve as a general framework for the NIPD of SGDs; this will facilitate easy integration of further improvements. One such improvement that is presented in the current study is a machine learning model that corrects errors based on patterns found in previously processed data. Overall, we show that next-generation sequencing (NGS) can be used for the NIPD of a wide range of monogenic diseases, simultaneously. We believe that our study will lead to the achievement of a comprehensive NIPD for monogenic diseases.
Collapse
Affiliation(s)
- Tom Rabinowitz
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - Avital Polsky
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - David Golan
- Faculty of Industrial Engineering and Management, Technion, Haifa, 3200003, Israel
| | - Artem Danilevsky
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - Guy Shapira
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - Chen Raff
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - Lina Basel-Salmon
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 6997801, Israel.,Raphael Recanati Genetic Institute, Rabin Medical Center, Beilinson Hospital, Petah Tikva, 4941494, Israel
| | - Reut Tomashov Matar
- Raphael Recanati Genetic Institute, Rabin Medical Center, Beilinson Hospital, Petah Tikva, 4941494, Israel
| | - Noam Shomron
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, 6997801, Israel
| |
Collapse
|
16
|
Sun Z, Bhagwate A, Prodduturi N, Yang P, Kocher JPA. Indel detection from RNA-seq data: tool evaluation and strategies for accurate detection of actionable mutations. Brief Bioinform 2018; 18:973-983. [PMID: 27473065 PMCID: PMC5862335 DOI: 10.1093/bib/bbw069] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2016] [Indexed: 11/29/2022] Open
Abstract
Driver somatic mutations are a hallmark of a tumor that can be used for diagnosis and targeted therapy. Mutations are primarily detected from tumor DNA. As dynamic molecules of gene activities, transcriptome profiling by RNA sequence (RNA-seq) is becoming increasingly popular, which not only measures gene expression but also structural variations such as mutations and fusion transcripts. Although single-nucleotide variants (SNVs) can be easily identified from RNA-seq, intermediate long insertions/deletions (indels > 2 bases and less than sequence reads) cause significant challenges and are ignored by most RNA-seq analysis tools. This study evaluates commonly used RNA-seq analysis programs along with variant and somatic mutation callers in a series of data sets with simulated and known indels. The aim is to develop strategies for accurate indel detection. Our results show that the RNA-seq alignment is the most important step for indel identification and the evaluated programs have a wide range of sensitivity to map sequence reads with indels, from not at all to decently sensitive. The sensitivity is impacted by sequence read lengths. Most variant calling programs rely on hard evidence indels marked in the alignment and the programs with realignment may use soft-clipped reads for indel inferencing. Based on the observations, we have provided practical recommendations for indel detection when different RNA-seq aligners are used and demonstrated the best option with highly reliable results. With careful customization of bioinformatics algorithms, RNA-seq can be reliably used for both SNV and indel mutation detection that can be used for clinical decision-making.
Collapse
Affiliation(s)
- Zhifu Sun
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
- Corresponding author: Zhifu Sun, Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 First St SW, Rochester, MN 55905, USA. Tel.: 507-266-1894; Fax: 507-284-0360; E-mail:
| | - Aditya Bhagwate
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Naresh Prodduturi
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Ping Yang
- Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Jean-Pierre A Kocher
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
17
|
Kotelnikova EA, Pyatnitskiy M, Paleeva A, Kremenetskaya O, Vinogradov D. Practical aspects of NGS-based pathways analysis for personalized cancer science and medicine. Oncotarget 2018; 7:52493-52516. [PMID: 27191992 PMCID: PMC5239569 DOI: 10.18632/oncotarget.9370] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Accepted: 04/18/2016] [Indexed: 12/17/2022] Open
Abstract
Nowadays, the personalized approach to health care and cancer care in particular is becoming more and more popular and is taking an important place in the translational medicine paradigm. In some cases, detection of the patient-specific individual mutations that point to a targeted therapy has already become a routine practice for clinical oncologists. Wider panels of genetic markers are also on the market which cover a greater number of possible oncogenes including those with lower reliability of resulting medical conclusions. In light of the large availability of high-throughput technologies, it is very tempting to use complete patient-specific New Generation Sequencing (NGS) or other "omics" data for cancer treatment guidance. However, there are still no gold standard methods and protocols to evaluate them. Here we will discuss the clinical utility of each of the data types and describe a systems biology approach adapted for single patient measurements. We will try to summarize the current state of the field focusing on the clinically relevant case-studies and practical aspects of data processing.
Collapse
Affiliation(s)
- Ekaterina A Kotelnikova
- Personal Biomedicine, Moscow, Russia.,A. A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.,Institute Biomedical Research August Pi Sunyer (IDIBAPS), Hospital Clinic of Barcelona, Barcelona, Spain
| | - Mikhail Pyatnitskiy
- Personal Biomedicine, Moscow, Russia.,Orekhovich Institute of Biomedical Chemistry, Moscow, Russia.,Pirogov Russian National Research Medical University, Moscow, Russia
| | | | - Olga Kremenetskaya
- Personal Biomedicine, Moscow, Russia.,Center for Theoretical Problems of Physicochemical Pharmacology, Russian Academy of Sciences, Moscow, Russia
| | - Dmitriy Vinogradov
- Personal Biomedicine, Moscow, Russia.,A. A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.,Lomonosov Moscow State University, Moscow, Russia
| |
Collapse
|
18
|
Wadapurkar RM, Vyas R. Computational analysis of next generation sequencing data and its applications in clinical oncology. INFORMATICS IN MEDICINE UNLOCKED 2018. [DOI: 10.1016/j.imu.2018.05.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|
19
|
Felten A, Vila Nova M, Durimel K, Guillier L, Mistou MY, Radomski N. First gene-ontology enrichment analysis based on bacterial coregenome variants: insights into adaptations of Salmonella serovars to mammalian- and avian-hosts. BMC Microbiol 2017; 17:222. [PMID: 29183286 PMCID: PMC5706153 DOI: 10.1186/s12866-017-1132-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2017] [Accepted: 11/16/2017] [Indexed: 12/13/2022] Open
Abstract
Background Many of the bacterial genomic studies exploring evolution processes of the host adaptation focus on the accessory genome describing how the gains and losses of genes can explain the colonization of new habitats. Consequently, we developed a new approach focusing on the coregenome in order to describe the host adaptation of Salmonella serovars. Methods In the present work, we propose bioinformatic tools allowing (i) robust phylogenetic inference based on SNPs and recombination events, (ii) identification of fixed SNPs and InDels distinguishing homoplastic and non-homoplastic coregenome variants, and (iii) gene-ontology enrichment analyses to describe metabolic processes involved in adaptation of Salmonella enterica subsp. enterica to mammalian- (S. Dublin), multi- (S. Enteritidis), and avian- (S. Pullorum and S. Gallinarum) hosts. Results The ‘VARCall’ workflow produced a robust phylogenetic inference confirming that the monophyletic clade S. Dublin diverged from the polyphyletic clade S. Enteritidis which includes the divergent clades S. Pullorum and S. Gallinarum (i). The scripts ‘phyloFixedVar’ and ‘FixedVar’ detected non-synonymous and non-homoplastic fixed variants supporting the phylogenetic reconstruction (ii). The scripts ‘GetGOxML’ and ‘EveryGO’ identified representative metabolic pathways related to host adaptation using the first gene-ontology enrichment analysis based on bacterial coregenome variants (iii). Conclusions We propose in the present manuscript a new coregenome approach coupling identification of fixed SNPs and InDels with regards to inferred phylogenetic clades, and gene-ontology enrichment analysis in order to describe the adaptation of Salmonella serovars Dublin (i.e. mammalian-hosts), Enteritidis (i.e. multi-hosts), Pullorum (i.e. avian-hosts) and Gallinarum (i.e. avian-hosts) at the coregenome scale. All these polyvalent Bioinformatic tools can be applied on other bacterial genus without additional developments. Electronic supplementary material The online version of this article (10.1186/s12866-017-1132-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Arnaud Felten
- Université PARIS-EST, Anses, Laboratory for food safety, Maisons-Alfort, France
| | - Meryl Vila Nova
- Université PARIS-EST, Anses, Laboratory for food safety, Maisons-Alfort, France
| | - Kevin Durimel
- Université PARIS-EST, Anses, Laboratory for food safety, Maisons-Alfort, France
| | - Laurent Guillier
- Université PARIS-EST, Anses, Laboratory for food safety, Maisons-Alfort, France
| | - Michel-Yves Mistou
- Université PARIS-EST, Anses, Laboratory for food safety, Maisons-Alfort, France
| | - Nicolas Radomski
- Université PARIS-EST, Anses, Laboratory for food safety, Maisons-Alfort, France.
| |
Collapse
|
20
|
Abstract
Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and misleads downstream analysis. It is thus desirable to have a unified system for identifying and representing equivalent indels. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare different indel calling results. UPS-indel identifies 15% redundant indels in dbSNP, 29% in COSMIC coding, and 13% in COSMIC noncoding datasets across all human chromosomes, higher than previously reported. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to state-of-the-art approaches for indel call set comparison demonstrates its clear superiority in finding common indels among call sets. UPS-indel is theoretically proven to find all equivalent indels, and thus exhaustive.
Collapse
|
21
|
Song X, Haghighi A, Iliuta IA, Pei Y. Molecular diagnosis of autosomal dominant polycystic kidney disease. Expert Rev Mol Diagn 2017; 17:885-895. [PMID: 28724316 DOI: 10.1080/14737159.2017.1358088] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
INTRODUCTION Autosomal dominant polycystic kidney disease (ADPKD) is the most common inherited kidney disease that accounts for 5-10% of end-stage renal disease in developed countries. Mutations in PKD1 and PKD2 account for a majority of cases. Mutation screening of PKD1 is technically challenging largely due to the complexity resulting from duplication of its first 33 exons in six highly homologous pseudogenes (i.e. PKD1P1-P6). Protocol using locus-specific long-range and nested PCR has enabled comprehensive PKD1 mutation screening but is labor-intensive and costly. Here, the authors review how recent advances in Next Generation Sequencing are poised to transform and extend molecular diagnosis of ADPKD. Areas covered: Key original research articles and reviews of the topic published in English identified through PubMed from 1957-2017. Expert commentary: The authors review current and evolving approaches using targeted resequencing or whole genome sequencing for screening typical as well as challenging cases (e.g. cases with no detectable PKD1 and PKD2 mutations which may be due to somatic mosaicism or other cystic disease; and complex genetics such as bilineal disease).
Collapse
Affiliation(s)
- Xuewen Song
- a Division of Nephrology , University Health Network and University of Toronto , Toronto , ON , Canada
| | - Amirreza Haghighi
- a Division of Nephrology , University Health Network and University of Toronto , Toronto , ON , Canada
| | - Ioan-Andrei Iliuta
- a Division of Nephrology , University Health Network and University of Toronto , Toronto , ON , Canada
| | - York Pei
- a Division of Nephrology , University Health Network and University of Toronto , Toronto , ON , Canada
| |
Collapse
|
22
|
Kim BY, Park JH, Jo HY, Koo SK, Park MH. Optimized detection of insertions/deletions (INDELs) in whole-exome sequencing data. PLoS One 2017; 12:e0182272. [PMID: 28792971 PMCID: PMC5549930 DOI: 10.1371/journal.pone.0182272] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2017] [Accepted: 07/14/2017] [Indexed: 12/22/2022] Open
Abstract
Insertion and deletion (INDEL) mutations, the most common type of structural variance, are associated with several human diseases. The detection of INDELs through next-generation sequencing (NGS) is becoming more common due to the decrease in costs, the increase in efficiency, and sensitivity improvements demonstrated by the various sequencing platforms and analytical tools. However, there are still many errors associated with INDEL variant calling, and distinguishing INDELs from errors in NGS remains challenging. To evaluate INDEL calling from whole-exome sequencing (WES) data, we performed Sanger sequencing for all INDELs called from the several calling algorithm. We compared the performance of the four algorithms (i.e. GATK, SAMtools, Dindel, and Freebayes) for INDEL detection from the same sample. We examined the sensitivity and PPV of GATK (90.2 and 89.5%, respectively), SAMtools (75.3 and 94.4%, respectively), Dindel (90.1 and 88.6%, respectively), and Freebayes (80.1 and 94.4%, respectively). GATK had the highest sensitivity. Furthermore, we identified INDELs with high PPV (4 algorithms intersection: 98.7%, 3 algorithms intersection: 97.6%, and GATK and SAMtools intersection INDELs: 97.6%). We presented two key sources of difficulties in accurate INDEL detection: 1) the presence of repeat, and 2) heterozygous INDELs. Herein we could suggest the accessible algorithms that selectively reduce error rates and thereby facilitate INDEL detection. Our study may also serve as a basis for understanding the accuracy and completeness of INDEL detection.
Collapse
Affiliation(s)
- Bo-Young Kim
- Division of Intractable Diseases, Center for Biomedical Sciences, Korea National Institute of Health, Chungcheongbuk-do, South Korea
| | | | - Hye-Yeong Jo
- Division of Intractable Diseases, Center for Biomedical Sciences, Korea National Institute of Health, Chungcheongbuk-do, South Korea
| | - Soo Kyung Koo
- Division of Intractable Diseases, Center for Biomedical Sciences, Korea National Institute of Health, Chungcheongbuk-do, South Korea
- * E-mail: (SKK); (MHP)
| | - Mi-Hyun Park
- Division of Intractable Diseases, Center for Biomedical Sciences, Korea National Institute of Health, Chungcheongbuk-do, South Korea
- * E-mail: (SKK); (MHP)
| |
Collapse
|
23
|
Chintalapati M, Dannemann M, Prüfer K. Using the Neandertal genome to study the evolution of small insertions and deletions in modern humans. BMC Evol Biol 2017; 17:179. [PMID: 28778150 PMCID: PMC5543596 DOI: 10.1186/s12862-017-1018-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 07/19/2017] [Indexed: 12/24/2022] Open
Abstract
Background Small insertions and deletions occur in humans at a lower rate compared to nucleotide changes, but evolve under more constraint than nucleotide changes. While the evolution of insertions and deletions have been investigated using ape outgroups, the now available genome of a Neandertal can shed light on the evolution of indels in more recent times. Results We used the Neandertal genome together with several primate outgroup genomes to differentiate between human insertion/deletion changes that likely occurred before the split from Neandertals and those that likely arose later. Changes that pre-date the split from Neandertals show a smaller proportion of deletions than those that occurred later. The presence of a Neandertal-shared allele in Europeans or Asians but the absence in Africans was used to detect putatively introgressed indels in Europeans and Asians. A larger proportion of these variants reside in intergenic regions compared to other modern human variants, and some variants are linked to SNPs that have been associated with traits in modern humans. Conclusions Our results are in agreement with earlier results that suggested that deletions evolve under more constraint than insertions. When considering Neandertal introgressed variants, we find some evidence that negative selection affected these variants more than other variants segregating in modern humans. Among introgressed variants we also identify indels that may influence the phenotype of their carriers. In particular an introgressed deletion associated with a decrease in the time to menarche may constitute an example of a former Neandertal-specific trait contributing to modern human phenotypic diversity. Electronic supplementary material The online version of this article (doi:10.1186/s12862-017-1018-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Michael Dannemann
- Max Planck Institute for Evolutionary Anthropology, 04103, Leipzig, Germany
| | - Kay Prüfer
- Max Planck Institute for Evolutionary Anthropology, 04103, Leipzig, Germany.
| |
Collapse
|
24
|
Replication Errors Made During Oogenesis Lead to Detectable De Novo mtDNA Mutations in Zebrafish Oocytes with a Low mtDNA Copy Number. Genetics 2016; 204:1423-1431. [PMID: 27770035 DOI: 10.1534/genetics.116.194035] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 10/13/2016] [Indexed: 01/30/2023] Open
Abstract
Of all pathogenic mitochondrial DNA (mtDNA) mutations in humans, ∼25% is de novo, although the occurrence in oocytes has never been directly assessed. We used next-generation sequencing to detect point mutations directly in the mtDNA of 3-15 individual mature oocytes and three somatic tissues from eight zebrafish females. Various statistical and biological filters allowed reliable detection of de novo variants with heteroplasmy ≥1.5%. In total, we detected 38 de novo base substitutions, but no insertions or deletions. These 38 de novo mutations were present in 19 of 103 mature oocytes, indicating that ∼20% of the mature oocytes carry at least one de novo mutation with heteroplasmy ≥1.5%. This frequency of de novo mutations is close to that deducted from the reported error rate of polymerase gamma, the mitochondrial replication enzyme, implying that mtDNA replication errors made during oogenesis are a likely explanation. Substantial variation in the mutation prevalence among mature oocytes can be explained by the highly variable mtDNA copy number, since we previously reported that ∼20% of the primordial germ cells have a mtDNA copy number of ≤73 and would lead to detectable mutation loads. In conclusion, replication errors made during oogenesis are an important source of de novo mtDNA base substitutions and their location and heteroplasmy level determine their significance.
Collapse
|
25
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
26
|
Abstract
When reporting research findings, scientists document the steps they followed so that others can verify and build upon the research. When those steps have been described in sufficient detail that others can retrace the steps and obtain similar results, the research is said to be reproducible. Computers play a vital role in many research disciplines and present both opportunities and challenges for reproducibility. Computers can be programmed to execute analysis tasks, and those programs can be repeated and shared with others. The deterministic nature of most computer programs means that the same analysis tasks, applied to the same data, will often produce the same outputs. However, in practice, computational findings often cannot be reproduced because of complexities in how software is packaged, installed, and executed-and because of limitations associated with how scientists document analysis steps. Many tools and techniques are available to help overcome these challenges; here we describe seven such strategies. With a broad scientific audience in mind, we describe the strengths and limitations of each approach, as well as the circumstances under which each might be applied. No single strategy is sufficient for every scenario; thus we emphasize that it is often useful to combine approaches.
Collapse
Affiliation(s)
- Stephen R Piccolo
- Department of Biology, Brigham Young University, Provo, UT, 84602, USA.
| | - Michael B Frampton
- Department of Computer Science, Brigham Young University, Provo, UT, USA
| |
Collapse
|
27
|
Deng X, den Bakker HC, Hendriksen RS. Genomic Epidemiology: Whole-Genome-Sequencing-Powered Surveillance and Outbreak Investigation of Foodborne Bacterial Pathogens. Annu Rev Food Sci Technol 2016; 7:353-74. [PMID: 26772415 DOI: 10.1146/annurev-food-041715-033259] [Citation(s) in RCA: 124] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
As we are approaching the twentieth anniversary of PulseNet, a network of public health and regulatory laboratories that has changed the landscape of foodborne illness surveillance through molecular subtyping, public health microbiology is undergoing another transformation brought about by so-called next-generation sequencing (NGS) technologies that have made whole-genome sequencing (WGS) of foodborne bacterial pathogens a realistic and superior alternative to traditional subtyping methods. Routine, real-time, and widespread application of WGS in food safety and public health is on the horizon. Technological, operational, and policy challenges are still present and being addressed by an international and multidisciplinary community of researchers, public health practitioners, and other stakeholders.
Collapse
Affiliation(s)
- Xiangyu Deng
- Center for Food Safety and Department of Food Science and Technology, University of Georgia, Griffin, Georgia 30269;
| | - Henk C den Bakker
- International Center for Food Industry Excellence, Department of Animal and Food Sciences, Texas Tech University, Lubbock, Texas 79409
| | - Rene S Hendriksen
- National Food Institute, Research Group of Genomic Epidemiology, Technical University of Denmark, Kongens Lyngby, DK-2800 Denmark
| |
Collapse
|
28
|
Next Generation Sequencing Data and Proteogenomics. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 926:11-19. [DOI: 10.1007/978-3-319-42316-6_2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
29
|
Yang R, Nelson AC, Henzler C, Thyagarajan B, Silverstein KAT. ScanIndel: a hybrid framework for indel detection via gapped alignment, split reads and de novo assembly. Genome Med 2015; 7:127. [PMID: 26643039 PMCID: PMC4671222 DOI: 10.1186/s13073-015-0251-2] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 11/18/2015] [Indexed: 12/30/2022] Open
Abstract
Comprehensive identification of insertions/deletions (indels) across the full size spectrum from second generation sequencing is challenging due to the relatively short read length inherent in the technology. Different indel calling methods exist but are limited in detection to specific sizes with varying accuracy and resolution. We present ScanIndel, an integrated framework for detecting indels with multiple heuristics including gapped alignment, split reads and de novo assembly. Using simulation data, we demonstrate ScanIndel’s superior sensitivity and specificity relative to several state-of-the-art indel callers across various coverage levels and indel sizes. ScanIndel yields higher predictive accuracy with lower computational cost compared with existing tools for both targeted resequencing data from tumor specimens and high coverage whole-genome sequencing data from the human NIST standard NA12878. Thus, we anticipate ScanIndel will improve indel analysis in both clinical and research settings. ScanIndel is implemented in Python, and is freely available for academic use at https://github.com/cauyrd/ScanIndel.
Collapse
Affiliation(s)
- Rendong Yang
- Supercomputing Institute for Advanced Computational Research, University of Minnesota, 117 Pleasant St. SE, RM 541, Minneapolis, MN, 55455, USA.
| | - Andrew C Nelson
- Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN, 55455, USA.
| | - Christine Henzler
- Supercomputing Institute for Advanced Computational Research, University of Minnesota, 117 Pleasant St. SE, RM 541, Minneapolis, MN, 55455, USA.
| | - Bharat Thyagarajan
- Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN, 55455, USA.
| | - Kevin A T Silverstein
- Supercomputing Institute for Advanced Computational Research, University of Minnesota, 117 Pleasant St. SE, RM 541, Minneapolis, MN, 55455, USA.
| |
Collapse
|
30
|
Gézsi A, Bolgár B, Marx P, Sarkozy P, Szalai C, Antal P. VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering. BMC Genomics 2015; 16:875. [PMID: 26510841 PMCID: PMC4625715 DOI: 10.1186/s12864-015-2050-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 10/06/2015] [Indexed: 11/10/2022] Open
Abstract
Background The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data. Results We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision. Conclusions VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2050-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- András Gézsi
- Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary. .,Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Bence Bolgár
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Péter Marx
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Peter Sarkozy
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| | - Csaba Szalai
- Department of Genetics, Cell- and Immunobiology, Semmelweis University, Nagyvárad tér 4, Budapest, H-1089, Hungary.
| | - Péter Antal
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest, H-1117, Hungary.
| |
Collapse
|
31
|
Hasan MS, Wu X, Zhang L. Performance evaluation of indel calling tools using real short-read data. Hum Genomics 2015; 9:20. [PMID: 26286629 PMCID: PMC4545535 DOI: 10.1186/s40246-015-0042-2] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Accepted: 07/20/2015] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Insertion and deletion (indel), a common form of genetic variation, has been shown to cause or contribute to human genetic diseases and cancer. With the advance of next-generation sequencing technology, many indel calling tools have been developed; however, evaluation and comparison of these tools using large-scale real data are still scant. Here we evaluated seven popular and publicly available indel calling tools, GATK Unified Genotyper, VarScan, Pindel, SAMtools, Dindel, GTAK HaplotypeCaller, and Platypus, using 78 human genome low-coverage data from the 1000 Genomes project. RESULTS Comparing indels called by these tools with a known set of indels, we found that Platypus outperforms other tools. In addition, a high percentage of known indels still remain undetected and the number of common indels called by all seven tools is very low. CONCLUSION All these findings indicate the necessity of improving the existing tools or developing new algorithms to achieve reliable and consistent indel calling results.
Collapse
Affiliation(s)
| | - Xiaowei Wu
- Department of Statistics, Virginia Tech, Blacksburg, VA, 24061, USA.
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, Blacksburg, VA, 24061, USA.
| |
Collapse
|
32
|
Liu J, Qu J, Yang C, Tang D, Li J, Lan H, Rong T. Development of genome-wide insertion and deletion markers for maize, based on next-generation sequencing data. BMC Genomics 2015; 16:601. [PMID: 26269146 PMCID: PMC4535256 DOI: 10.1186/s12864-015-1797-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2014] [Accepted: 07/24/2015] [Indexed: 12/11/2022] Open
Abstract
Background Insertions and deletions (indels) are the most abundant form of structural variation in all genomes. Indels have been increasingly recognized as an important source of molecular markers due to high-density occurrence, cost-effectiveness, and ease of genotyping. Coupled with developments in bioinformatics, next-generation sequencing (NGS) platforms enable the discovery of millions of indel polymorphisms by comparing the whole genome sequences of individuals within a species. Results A total of 1,973,746 unique indels were identified in 345 maize genomes, with an overall density of 958.79 indels/Mbp, and an average allele number of 2.76, ranging from 2 to 107. There were 264,214 indels with polymorphism information content (PIC) values greater than or equal to 0.5, accounting for 13.39 % of overall indels. Of these highly polymorphic indels, we designed primer pairs for 83,481 and 29,403 indels with major allele differences (i.e. the size difference between the most and second most frequent alleles) greater than or equal to 3 and 8 bp, respectively, based on the differing resolution capabilities of gel electrophoresis. The accuracy of our indel markers was experimentally validated, and among 100 indel markers, average accuracy was approximately 90 %. In addition, we also validated the polymorphism of the indel markers. Of 100 highly polymorphic indel markers, all had polymorphisms with average PIC values of 0.54. Conclusions The maize genome is rich in indel polymorphisms. Intriguingly, the level of polymorphism in genic regions of the maize genome was higher than that in intergenic regions. The polymorphic indel markers developed from this study may enhance the efficiency of genetic research and marker-assisted breeding in maize. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1797-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jian Liu
- Maize Research Institute, Sichuan Agricultural University, Chengdu, 611130, China. .,Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan Agricultural University, Chengdu, 611130, China.
| | - Jingtao Qu
- Maize Research Institute, Sichuan Agricultural University, Chengdu, 611130, China.
| | - Cong Yang
- Maize Research Institute, Sichuan Agricultural University, Chengdu, 611130, China.
| | - Dengguo Tang
- Maize Research Institute, Sichuan Agricultural University, Chengdu, 611130, China.
| | - Jingwei Li
- Maize Research Institute, Sichuan Agricultural University, Chengdu, 611130, China.
| | - Hai Lan
- Maize Research Institute, Sichuan Agricultural University, Chengdu, 611130, China. .,Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan Agricultural University, Chengdu, 611130, China.
| | - Tingzhao Rong
- Maize Research Institute, Sichuan Agricultural University, Chengdu, 611130, China. .,Key Laboratory of Biology and Genetic Improvement of Maize in Southwest Region, Ministry of Agriculture, Sichuan Agricultural University, Chengdu, 611130, China.
| |
Collapse
|
33
|
Boschiero C, Gheyas AA, Ralph HK, Eory L, Paton B, Kuo R, Fulton J, Preisinger R, Kaiser P, Burt DW. Detection and characterization of small insertion and deletion genetic variants in modern layer chicken genomes. BMC Genomics 2015; 16:562. [PMID: 26227840 PMCID: PMC4563830 DOI: 10.1186/s12864-015-1711-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Accepted: 06/22/2015] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Small insertions and deletions (InDels) constitute the second most abundant class of genetic variants and have been found to be associated with many traits and diseases. The present study reports on the detection and characterisation of about 883 K high quality InDels from the whole-genome analysis of several modern layer chicken lines from diverse breeds. RESULTS To reduce the error rates seen in InDel detection, this study used the consensus set from two InDel-calling packages: SAMtools and Dindel, as well as stringent post-filtering criteria. By analysing sequence data from 163 chickens from 11 commercial and 5 experimental layer lines, this study detected about 883 K high quality consensus InDels with 93% validation rate and an average density of 0.78 InDels/kb over the genome. Certain chromosomes, viz, GGAZ, 16, 22 and 25 showed very low densities of InDels whereas the highest rate was observed on GGA6. In spite of the higher recombination rates on microchromosomes, the InDel density on these chromosomes was generally lower relative to macrochromosomes possibly due to their higher gene density. About 43-87% of the InDels were found to be fixed within each line. The majority of detected InDels (86%) were 1-5 bases and about 63% were non-repetitive in nature while the rest were tandem repeats of various motif types. Functional annotation identified 613 frameshift, 465 non-frameshift and 10 stop-gain/loss InDels. Apart from the frameshift and stopgain/loss InDels that are expected to affect the translation of protein sequences and their biological activity, 33% of the non-frameshift were predicted as evolutionary intolerant with potential impact on protein functions. Moreover, about 2.5% of the InDels coincided with the most-conserved elements previously mapped on the chicken genome and are likely to define functional elements. InDels potentially affecting protein function were found to be enriched for certain gene-classes e.g. those associated with cell proliferation, chromosome and Golgi organization, spermatogenesis, and muscle contraction. CONCLUSIONS The large catalogue of InDels presented in this study along with their associated information such as functional annotation, estimated allele frequency, etc. are expected to serve as a rich resource for application in future research and breeding in the chicken.
Collapse
Affiliation(s)
- Clarissa Boschiero
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK. .,Current Address: Departamento de Zootecnia, University of Sao Paulo/ESALQ, Piracicaba, SP, 13418-900, Brazil.
| | - Almas A Gheyas
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
| | - Hannah K Ralph
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
| | - Lel Eory
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
| | - Bob Paton
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
| | - Richard Kuo
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
| | | | | | - Pete Kaiser
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
| | - David W Burt
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
| |
Collapse
|
34
|
Schmid M, Smith J, Burt DW, Aken BL, Antin PB, Archibald AL, Ashwell C, Blackshear PJ, Boschiero C, Brown CT, Burgess SC, Cheng HH, Chow W, Coble DJ, Cooksey A, Crooijmans RPMA, Damas J, Davis RVN, de Koning DJ, Delany ME, Derrien T, Desta TT, Dunn IC, Dunn M, Ellegren H, Eöry L, Erb I, Farré M, Fasold M, Fleming D, Flicek P, Fowler KE, Frésard L, Froman DP, Garceau V, Gardner PP, Gheyas AA, Griffin DK, Groenen MAM, Haaf T, Hanotte O, Hart A, Häsler J, Hedges SB, Hertel J, Howe K, Hubbard A, Hume DA, Kaiser P, Kedra D, Kemp SJ, Klopp C, Kniel KE, Kuo R, Lagarrigue S, Lamont SJ, Larkin DM, Lawal RA, Markland SM, McCarthy F, McCormack HA, McPherson MC, Motegi A, Muljo SA, Münsterberg A, Nag R, Nanda I, Neuberger M, Nitsche A, Notredame C, Noyes H, O'Connor R, O'Hare EA, Oler AJ, Ommeh SC, Pais H, Persia M, Pitel F, Preeyanon L, Prieto Barja P, Pritchett EM, Rhoads DD, Robinson CM, Romanov MN, Rothschild M, Roux PF, Schmidt CJ, Schneider AS, Schwartz MG, Searle SM, Skinner MA, Smith CA, Stadler PF, Steeves TE, Steinlein C, Sun L, Takata M, Ulitsky I, Wang Q, Wang Y, et alSchmid M, Smith J, Burt DW, Aken BL, Antin PB, Archibald AL, Ashwell C, Blackshear PJ, Boschiero C, Brown CT, Burgess SC, Cheng HH, Chow W, Coble DJ, Cooksey A, Crooijmans RPMA, Damas J, Davis RVN, de Koning DJ, Delany ME, Derrien T, Desta TT, Dunn IC, Dunn M, Ellegren H, Eöry L, Erb I, Farré M, Fasold M, Fleming D, Flicek P, Fowler KE, Frésard L, Froman DP, Garceau V, Gardner PP, Gheyas AA, Griffin DK, Groenen MAM, Haaf T, Hanotte O, Hart A, Häsler J, Hedges SB, Hertel J, Howe K, Hubbard A, Hume DA, Kaiser P, Kedra D, Kemp SJ, Klopp C, Kniel KE, Kuo R, Lagarrigue S, Lamont SJ, Larkin DM, Lawal RA, Markland SM, McCarthy F, McCormack HA, McPherson MC, Motegi A, Muljo SA, Münsterberg A, Nag R, Nanda I, Neuberger M, Nitsche A, Notredame C, Noyes H, O'Connor R, O'Hare EA, Oler AJ, Ommeh SC, Pais H, Persia M, Pitel F, Preeyanon L, Prieto Barja P, Pritchett EM, Rhoads DD, Robinson CM, Romanov MN, Rothschild M, Roux PF, Schmidt CJ, Schneider AS, Schwartz MG, Searle SM, Skinner MA, Smith CA, Stadler PF, Steeves TE, Steinlein C, Sun L, Takata M, Ulitsky I, Wang Q, Wang Y, Warren WC, Wood JMD, Wragg D, Zhou H. Third Report on Chicken Genes and Chromosomes 2015. Cytogenet Genome Res 2015; 145:78-179. [PMID: 26282327 PMCID: PMC5120589 DOI: 10.1159/000430927] [Show More Authors] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Michael Schmid
- Department of Human Genetics, University of Würzburg, Würzburg, Germany
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Jiang Y, Turinsky AL, Brudno M. The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection. Nucleic Acids Res 2015; 43:7217-28. [PMID: 26130710 PMCID: PMC4551921 DOI: 10.1093/nar/gkv677] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 06/19/2015] [Indexed: 12/22/2022] Open
Abstract
With the development of High-Throughput Sequencing (HTS) thousands of human genomes have now been sequenced. Whenever different studies analyze the same genome they usually agree on the amount of single-nucleotide polymorphisms, but differ dramatically on the number of insertion and deletion variants (indels). Furthermore, there is evidence that indels are often severely under-reported. In this manuscript we derive the total number of indel variants in a human genome by combining data from different sequencing technologies, while assessing the indel detection accuracy. Our estimate of approximately 1 million indels in a Yoruban genome is much higher than the results reported in several recent HTS studies. We identify two key sources of difficulties in indel detection: the insufficient coverage, read length or alignment quality; and the presence of repeats, including short interspersed elements and homopolymers/dimers. We quantify the effect of these factors on indel detection. The quality of sequencing data plays a major role in improving indel detection by HTS methods. However, many indels exist in long homopolymers and repeats, where their detection is severely impeded. The true number of indel events is likely even higher than our current estimates, and new techniques and technologies will be required to detect them.
Collapse
Affiliation(s)
- Yue Jiang
- Centre for Computational Medicine, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada Center for Biomedical Informatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, 150001, China
| | - Andrei L Turinsky
- Centre for Computational Medicine, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Michael Brudno
- Centre for Computational Medicine, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada Program in Genetics and Genome Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada Department of Computer Science, University of Toronto, Toronto, ON, M5S 3G4, Canada
| |
Collapse
|
36
|
Bianco E, Nevado B, Ramos-Onsins SE, Pérez-Enciso M. A deep catalog of autosomal single nucleotide variation in the pig. PLoS One 2015; 10:e0118867. [PMID: 25789620 PMCID: PMC4366260 DOI: 10.1371/journal.pone.0118867] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Accepted: 12/27/2014] [Indexed: 12/31/2022] Open
Abstract
A comprehensive catalog of variability in a given species is useful for many important purposes, e.g., designing high density arrays or pinpointing potential mutations of economic or physiological interest. Here we provide a genomewide, worldwide catalog of single nucleotide variants by simultaneously analyzing the shotgun sequence of 128 pigs and five suid outgroups. Despite the high SNP missing rate of some individuals (up to 88%), we retrieved over 48 million high quality variants. Of them, we were able to assess the ancestral allele of more than 39M biallelic SNPs. We found SNPs in 21,455 out of the 25,322 annotated genes in pig assembly 10.2. The annotation showed that more than 40% of the variants were novel variants, not present in dbSNP. Surprisingly, we found a large variability in transition / transversion rate along the genome, which is very well explained (R2=0.79) primarily by genome differences in in CpG content and recombination rate. The number of SNPs per window also varied but was less dependent of known factors such as gene density, missing rate or recombination (R2=0.48). When we divided the samples in four groups, Asian wild boar (ASWB), Asian domestics (ASDM), European wild boar (EUWB) and European domestics (EUDM), we found a marked correlation in allele frequencies between domestics and wild boars within Asia and within Europe, but not across continents, due to the large evolutive distance between pigs of both continents (~1.2 MYA). In general, the porcine species showed a small percentage of SNPs exclusive of each population group. EUWB and EUDM were predicted to harbor a larger fraction of potentially deleterious mutations, according to the SIFT algorithm, than Asian samples, perhaps a result of background selection being less effective due to a lower effective population size in Europe.
Collapse
Affiliation(s)
- Erica Bianco
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Universitat Autònoma de Barcelona, Department of Animal Science, Bellaterra, Spain
| | - Bruno Nevado
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Universitat Autònoma de Barcelona, Department of Animal Science, Bellaterra, Spain
| | | | - Miguel Pérez-Enciso
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Universitat Autònoma de Barcelona, Department of Animal Science, Bellaterra, Spain
- Institut Català de Recerca I Estudis Avançats (ICREA), Carrer de Lluís Companys 23, Barcelona, Spain
- * E-mail:
| |
Collapse
|
37
|
Challis D, Antunes L, Garrison E, Banks E, Evani US, Muzny D, Poplin R, Gibbs RA, Marth G, Yu F. The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes. BMC Genomics 2015; 16:143. [PMID: 25765891 PMCID: PMC4352271 DOI: 10.1186/s12864-015-1333-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Accepted: 02/09/2015] [Indexed: 12/30/2022] Open
Abstract
Background Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls. Results This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%. Conclusions In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1333-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Danny Challis
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA. .,Present address: Monsanto Company, Ankeny, IA, 50021, USA.
| | - Lilian Antunes
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA. .,Present address: Washington University School of Medicine, Saint Louis, MO, 63110, USA.
| | - Erik Garrison
- Department of Biology, Boston College, Wellcome Trust Sanger Institute, Chestnut Hill, MA, 02467, USA.
| | - Eric Banks
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.
| | - Uday S Evani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA. .,Present address: New York Genome Center, New York, NY, 10013, USA.
| | - Donna Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Ryan Poplin
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Gabor Marth
- Department of Biology, Boston College, Wellcome Trust Sanger Institute, Chestnut Hill, MA, 02467, USA. .,Present address: Department of Human Genetics and Utah Center for Genetic Discovery, University of Utah School of Medicine, Salt Lake City, UT, 84112, USA.
| | - Fuli Yu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA. .,Institute of Neurology, Tianjin Medical University General Hospital, Tianjin, 300052, China.
| |
Collapse
|
38
|
Godoy TF, Moreira GCM, Boschiero C, Gheyas AA, Gasparin G, Paduan M, Andrade SCS, Montenegro H, Burt DW, Ledur MC, Coutinho LL. SNP and INDEL detection in a QTL region on chicken chromosome 2 associated with muscle deposition. Anim Genet 2015; 46:158-63. [PMID: 25690762 DOI: 10.1111/age.12271] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/09/2014] [Indexed: 11/28/2022]
Abstract
Genetic improvement is important for the poultry industry, contributing to increased efficiency of meat production and quality. Because breast muscle is the most valuable part of the chicken carcass, knowledge of polymorphisms influencing this trait can help breeding programs. Therefore, the complete genome of 18 chickens from two different experimental lines (broiler and layer) from EMBRAPA was sequenced, and SNPs and INDELs were detected in a QTL region for breast muscle deposition on chicken chromosome 2 between microsatellite markers MCW0185 and MCW0264 (105,849-112,649 kb). Initially, 94,674 unique SNPs and 10,448 unique INDELs were identified in the target region. After quality filtration, 77% of the SNPs (85,765) and 60% of the INDELs (7828) were retained. The studied region contains 66 genes, and functional annotation of the filtered variants identified 517 SNPs and three INDELs in exonic regions. Of these, 357 SNPs were classified as synonymous, 153 as non-synonymous, three as stopgain, four INDELs as frameshift and three INDELs as non-frameshift. These exonic mutations were identified in 37 of the 66 genes from the target region, three of which are related to muscle development (DTNA, RB1CC1 and MOS). Fifteen non-tolerated SNPs were detected in several genes (MEP1B, PRKDC, NSMAF, TRAPPC8, SDR16C5, CHD7, ST18 and RB1CC1). These loss-of-function and exonic variants present in genes related to muscle development can be considered candidate variants for further studies in chickens. Further association studies should be performed with these candidate mutations as should validation in commercial populations to allow a better explanation of QTL effects.
Collapse
Affiliation(s)
- T F Godoy
- Departamento de Zootecnia, ESALQ/USP, Av. Pádua Dias 11, Piracicaba, São Paulo, 13419-900, Brazil
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Chen W, Zhang L. The pattern of DNA cleavage intensity around indels. Sci Rep 2015; 5:8333. [PMID: 25660536 PMCID: PMC4321175 DOI: 10.1038/srep08333] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2014] [Accepted: 01/07/2015] [Indexed: 12/22/2022] Open
Abstract
Indels (insertions and deletions) are the second most common form of genetic variations in the eukaryotic genomes and are responsible for a multitude of genetic diseases. Despite its significance, detailed molecular mechanisms for indel generation are still unclear. Here we examined 2,656,597 small human and mouse germline indels, 16,742 human somatic indels, 10,599 large human insertions, and 5,822 large chimpanzee insertions and systematically analyzed the patterns of DNA cleavage intensities in the 200 base pair regions surrounding these indels. Our results show that DNA cleavage intensities close to the start and end points of indels are significantly lower than other regions, for both small human germline and somatic indels and also for mouse small indels. Compared to small indels, the patterns of DNA cleavage intensity around large indels are more complex, and there are two low intensity regions near each end of the indels that are approximately 13 bp apart from each other. Detailed analyses of a subset of indels show that there is slight difference in cleavage intensity distribution between insertion indels and deletion indels that could be contributed by their respective enrichment of different repetitive elements. These results will provide new insight into indel generation mechanisms.
Collapse
Affiliation(s)
- Wei Chen
- 1] Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan, China 063000 [2] Department of Computer Science, Virginia Tech, Blacksburg VA 24060
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, Blacksburg VA 24060
| |
Collapse
|
40
|
Moreira GCM, Godoy TF, Boschiero C, Gheyas A, Gasparin G, Andrade SCS, Paduan M, Montenegro H, Burt DW, Ledur MC, Coutinho LL. Variant discovery in a QTL region on chromosome 3 associated with fatness in chickens. Anim Genet 2015; 46:141-7. [PMID: 25643900 DOI: 10.1111/age.12263] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/01/2014] [Indexed: 12/17/2022]
Abstract
Abdominal fat content is an economically important trait in commercially bred chickens. Although many quantitative trait loci (QTL) related to fat deposition have been detected, the resolution for these regions is low and functional variants are still unknown. The current study was conducted aiming at increasing resolution for a region previously shown to have a QTL associated with fat deposition, to detect novel variants from this region and to annotate those variants to delineate potentially functional ones as candidates for future studies. To achieve this, 18 chickens from a parental generation used in a reciprocal cross between broiler and layer lines were sequenced using the Illumina next-generation platform with an initial coverage of 18X/chicken. The discovery of genetic variants was performed in a QTL region located on chromosome 3 between microsatellite markers LEI0161 and ADL0371 (33,595,706-42,632,651 bp). A total of 136,054 unique SNPs and 15,496 unique INDELs were detected in this region, and after quality filtering, 123,985 SNPs and 11,298 INDELs were retained. Of these variants, 386 SNPs and 15 INDELs were located in coding regions of genes related to important metabolic pathways. Loss-of-function variants were identified in several genes, and six of those, namely LOC771163, EGLN1, GNPAT, FAM120B, THBS2 and GGPS1, were related to fat deposition. Therefore, these loss-of-function variants are candidate mutations for conducting further studies on this important trait in chickens.
Collapse
Affiliation(s)
- G C M Moreira
- Departamento de Zootecnia, USP/ESALQ, Piracicaba, SP, 13418-900, Brazil
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
41
|
Ghoneim DH, Myers JR, Tuttle E, Paciorkowski AR. Comparison of insertion/deletion calling algorithms on human next-generation sequencing data. BMC Res Notes 2014; 7:864. [PMID: 25435282 PMCID: PMC4265454 DOI: 10.1186/1756-0500-7-864] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2014] [Accepted: 11/21/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Insertions/deletions (indels) are the second most common type of genomic variant and the most common type of structural variant. Identification of indels in next generation sequencing data is a challenge, and algorithms commonly used for indel detection have not been compared on a research cohort of human subject genomic data. Guidelines for the optimal detection of biologically significant indels are limited. We analyzed three sets of human next generation sequencing data (48 samples of a 200 gene target exon sequencing, 45 samples of whole exome sequencing, and 2 samples of whole genome sequencing) using three algorithms for indel detection (Pindel, Genome Analysis Tool Kit's UnifiedGenotyper and HaplotypeCaller). RESULTS We observed variation in indel calls across the three algorithms. The intersection of the three tools comprised only 5.70% of targeted exon, 19.52% of whole exome, and 14.25% of whole genome indel calls. The majority of the discordant indels were of lower read depth and likely to be false positives. When software parameters were kept consistent across the three targets, HaplotypeCaller produced the most reliable results. Pindel results did not validate well without adjustments to parameters to account for varied read depth and number of samples per run. Adjustments to Pindel's M (minimum support for event) parameter improved both concordance and validation rates. Pindel was able to identify large deletions that surpassed the length capabilities of the GATK algorithms. CONCLUSIONS Despite the observed variability in indel identification, we discerned strengths among the individual algorithms on specific data sets. This allowed us to suggest best practices for indel calling. Pindel's low validation rate of indel calls made in targeted exon sequencing suggests that HaplotypeCaller is better suited for short indels and multi-sample runs in targets with very high read depth. Pindel allows for optimization of minimum support for events and is best used for detection of larger indels at lower read depths.
Collapse
Affiliation(s)
- Dalia H Ghoneim
- Center for Neural Development and Disease, University of Rochester Medical Center, 601 Elmwood Avenue, Rochester, NY, USA.
| | | | | | | |
Collapse
|
42
|
Abo RP, Ducar M, Garcia EP, Thorner AR, Rojas-Rudilla V, Lin L, Sholl LM, Hahn WC, Meyerson M, Lindeman NI, Van Hummelen P, MacConaill LE. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res 2014; 43:e19. [PMID: 25428359 PMCID: PMC4330340 DOI: 10.1093/nar/gku1211] [Citation(s) in RCA: 145] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Genomic structural variation (SV), a common hallmark of cancer, has important predictive and therapeutic implications. However, accurately detecting SV using high-throughput sequencing data remains challenging, especially for ‘targeted’ resequencing efforts. This is critically important in the clinical setting where targeted resequencing is frequently being applied to rapidly assess clinically actionable mutations in tumor biopsies in a cost-effective manner. We present BreaKmer, a novel approach that uses a ‘kmer’ strategy to assemble misaligned sequence reads for predicting insertions, deletions, inversions, tandem duplications and translocations at base-pair resolution in targeted resequencing data. Variants are predicted by realigning an assembled consensus sequence created from sequence reads that were abnormally aligned to the reference genome. Using targeted resequencing data from tumor specimens with orthogonally validated SV, non-tumor samples and whole-genome sequencing data, BreaKmer had a 97.4% overall sensitivity for known events and predicted 17 positively validated, novel variants. Relative to four publically available algorithms, BreaKmer detected SV with increased sensitivity and limited calls in non-tumor samples, key features for variant analysis of tumor specimens in both the clinical and research settings.
Collapse
Affiliation(s)
- Ryan P Abo
- Center for Cancer Genome Discovery and Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02215, USA
| | - Matthew Ducar
- Center for Cancer Genome Discovery and Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02215, USA
| | - Elizabeth P Garcia
- Department of Pathology, Brigham and Women's Hospital, Boston, MA 02215, USA
| | - Aaron R Thorner
- Center for Cancer Genome Discovery and Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02215, USA
| | | | - Ling Lin
- Center for Cancer Genome Discovery and Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02215, USA
| | - Lynette M Sholl
- Department of Pathology, Brigham and Women's Hospital, Boston, MA 02215, USA
| | - William C Hahn
- Center for Cancer Genome Discovery and Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02215, USA Broad Institute of Harvard and MIT, Cambridge, MA 02141, USA
| | - Matthew Meyerson
- Center for Cancer Genome Discovery and Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02215, USA Department of Pathology, Brigham and Women's Hospital, Boston, MA 02215, USA Broad Institute of Harvard and MIT, Cambridge, MA 02141, USA
| | - Neal I Lindeman
- Department of Pathology, Brigham and Women's Hospital, Boston, MA 02215, USA
| | - Paul Van Hummelen
- Center for Cancer Genome Discovery and Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02215, USA
| | - Laura E MacConaill
- Center for Cancer Genome Discovery and Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02215, USA Department of Pathology, Brigham and Women's Hospital, Boston, MA 02215, USA
| |
Collapse
|
43
|
Ward RM, Schmieder R, Highnam G, Mittelman D. Big data challenges and opportunities in high-throughput sequencing. ACTA ACUST UNITED AC 2014. [DOI: 10.4161/sysb.24470] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
44
|
Yan Y, Yi G, Sun C, Qu L, Yang N. Genome-wide characterization of insertion and deletion variation in chicken using next generation sequencing. PLoS One 2014; 9:e104652. [PMID: 25133774 PMCID: PMC4136736 DOI: 10.1371/journal.pone.0104652] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2014] [Accepted: 07/10/2014] [Indexed: 12/30/2022] Open
Abstract
Insertion and deletion (INDEL) is one of the main events contributing to genetic and phenotypic diversity, which receives less attention than SNP and large structural variation. To gain a better knowledge of INDEL variation in chicken genome, we applied next generation sequencing on 12 diverse chicken breeds at an average effective depth of 8.6. Over 1.3 million non-redundant short INDELs (1-49 bp) were obtained, the vast majority (92.48%) of which were novel. Follow-up validation assays confirmed that most (88.00%) of the randomly selected INDELs represent true variations. The majority (95.76%) of INDELs were less than 10 bp. Both the detected number and affected bases were larger for deletions than insertions. In total, INDELs covered 3.8 Mbp, corresponding to 0.36% of the chicken genome. The average genomic INDEL density was estimated as 0.49 per kb. INDELs were ubiquitous and distributed in a non-uniform fashion across chromosomes, with lower INDEL density in micro-chromosomes than in others, and some functional regions like exons and UTRs were prone to less INDELs than introns and intergenic regions. Nearly 620,253 INDELs fell in genic regions, 1,765 (0.28%) of which located in exons, spanning 1,358 (7.56%) unique Ensembl genes. Many of them are associated with economically important traits and some are the homologues of human disease-related genes. We demonstrate that sequencing multiple individuals at a medium depth offers a promising way for reliable identification of INDELs. The coding INDELs are valuable candidates for further elucidation of the association between genotypes and phenotypes. The chicken INDELs revealed by our study can be useful for future studies, including development of INDEL markers, construction of high density linkage map, INDEL arrays design, and hopefully, molecular breeding programs in chicken.
Collapse
Affiliation(s)
- Yiyuan Yan
- Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Guoqiang Yi
- Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Congjiao Sun
- Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Lujiang Qu
- Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Ning Yang
- Department of Animal Genetics and Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China
| |
Collapse
|
45
|
Chambers JC, Abbott J, Zhang W, Turro E, Scott WR, Tan ST, Afzal U, Afaq S, Loh M, Lehne B, O'Reilly P, Gaulton KJ, Pearson RD, Li X, Lavery A, Vandrovcova J, Wass MN, Miller K, Sehmi J, Oozageer L, Kooner IK, Al-Hussaini A, Mills R, Grewal J, Panoulas V, Lewin AM, Northwood K, Wander GS, Geoghegan F, Li Y, Wang J, Aitman TJ, McCarthy MI, Scott J, Butcher S, Elliott P, Kooner JS. The South Asian genome. PLoS One 2014; 9:e102645. [PMID: 25115870 PMCID: PMC4130493 DOI: 10.1371/journal.pone.0102645] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2013] [Accepted: 06/21/2014] [Indexed: 12/15/2022] Open
Abstract
The genetic sequence variation of people from the Indian subcontinent who comprise one-quarter of the world's population, is not well described. We carried out whole genome sequencing of 168 South Asians, along with whole-exome sequencing of 147 South Asians to provide deeper characterisation of coding regions. We identify 12,962,155 autosomal sequence variants, including 2,946,861 new SNPs and 312,738 novel indels. This catalogue of SNPs and indels amongst South Asians provides the first comprehensive map of genetic variation in this major human population, and reveals evidence for selective pressures on genes involved in skin biology, metabolism, infection and immunity. Our results will accelerate the search for the genetic variants underlying susceptibility to disorders such as type-2 diabetes and cardiovascular disease which are highly prevalent amongst South Asians.
Collapse
Affiliation(s)
- John C. Chambers
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
- Imperial College Healthcare NHS Trust, London, United Kingdom
- MRC-HPA Centre for Environment and Health, Imperial College London, Norfolk Place, London, United Kingdom
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
| | - James Abbott
- Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London, United Kingdom
| | - Weihua Zhang
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
| | - Ernest Turro
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
- Computational Biology and Statistics, University of Cambridge, Cambridge, United Kingdom
| | - William R. Scott
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Sian-Tsung Tan
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
- NHLI, Imperial College London, Hammersmith Hospital, London, United Kingdom
| | - Uzma Afzal
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Saima Afaq
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Marie Loh
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Benjamin Lehne
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Paul O'Reilly
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Kyle J. Gaulton
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
| | - Richard D. Pearson
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
| | - Xinzhong Li
- Institute of Clinical Sciences, Imperial College London, London, United Kingdom
- Royal Brompton and Harefield Hospitals NHS Trust, London, United Kingdom
| | - Anita Lavery
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Jana Vandrovcova
- MRC Clinical Sciences Centre, Imperial College London, London, United Kingdom
| | - Mark N. Wass
- Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London, United Kingdom
| | - Kathryn Miller
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
| | - Joban Sehmi
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
- NHLI, Imperial College London, Hammersmith Hospital, London, United Kingdom
| | | | | | - Abtehale Al-Hussaini
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Rebecca Mills
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
| | - Jagvir Grewal
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
| | | | - Alexandra M. Lewin
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
| | - Korrinne Northwood
- MRC Clinical Sciences Centre, Imperial College London, London, United Kingdom
| | - Gurpreet S. Wander
- Hero DMC Heart Institute, Dayanand Medical College and Hospital, Ludhiana, India
| | - Frank Geoghegan
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
| | | | | | - Timothy J. Aitman
- MRC Clinical Sciences Centre, Imperial College London, London, United Kingdom
| | - Mark I. McCarthy
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- Oxford Centre for Diabetes, Endocrinology & Metabolism, University of Oxford, Churchill Hospital, Oxford, United Kingdom
- Oxford NIHR Biomedical Research Centre, Churchill Hospital, Oxford, United Kingdom
| | - James Scott
- NHLI, Imperial College London, Hammersmith Hospital, London, United Kingdom
| | - Sarah Butcher
- Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London, United Kingdom
| | - Paul Elliott
- Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, United Kingdom
- Imperial College Healthcare NHS Trust, London, United Kingdom
- MRC-HPA Centre for Environment and Health, Imperial College London, Norfolk Place, London, United Kingdom
| | - Jaspal S. Kooner
- Imperial College Healthcare NHS Trust, London, United Kingdom
- Ealing Hospital NHS Trust, Southall, Middlesex, United Kingdom
- NHLI, Imperial College London, Hammersmith Hospital, London, United Kingdom
| |
Collapse
|
46
|
Van Geystelen A, Wenseleers T, Decorte R, Caspers MJL, Larmuseau MHD. In silico detection of phylogenetic informative Y-chromosomal single nucleotide polymorphisms from whole genome sequencing data. Electrophoresis 2014; 35:3102-10. [DOI: 10.1002/elps.201300459] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2013] [Revised: 12/03/2013] [Accepted: 01/07/2014] [Indexed: 12/25/2022]
Affiliation(s)
- Anneleen Van Geystelen
- Laboratory of Forensic Genetics and Molecular Archaeology; UZ Leuven Leuven Belgium
- Laboratory of Socioecology and Social Evolution; Department of Biology; KU Leuven Leuven Belgium
| | - Tom Wenseleers
- Laboratory of Socioecology and Social Evolution; Department of Biology; KU Leuven Leuven Belgium
| | - Ronny Decorte
- Laboratory of Forensic Genetics and Molecular Archaeology; UZ Leuven Leuven Belgium
- Biomedical Forensic Sciences; Department of Imaging & Pathology; KU Leuven Leuven Belgium
| | - Maarten J. L. Caspers
- Laboratory of Forensic Genetics and Molecular Archaeology; UZ Leuven Leuven Belgium
- Laboratory of Biodiversity and Evolutionary Genomics; Department of Biology; KU Leuven Leuven Belgium
| | - Maarten H. D. Larmuseau
- Laboratory of Forensic Genetics and Molecular Archaeology; UZ Leuven Leuven Belgium
- Biomedical Forensic Sciences; Department of Imaging & Pathology; KU Leuven Leuven Belgium
- Laboratory of Biodiversity and Evolutionary Genomics; Department of Biology; KU Leuven Leuven Belgium
| |
Collapse
|
47
|
Cao MD, Tasker E, Willadsen K, Imelfort M, Vishwanathan S, Sureshkumar S, Balasubramanian S, Bodén M. Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Res 2013; 42:e16. [PMID: 24353318 PMCID: PMC3919575 DOI: 10.1093/nar/gkt1313] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The advances of high-throughput sequencing offer an unprecedented opportunity to study genetic variation. This is challenged by the difficulty of resolving variant calls in repetitive DNA regions. We present a Bayesian method to estimate repeat-length variation from paired-end sequence read data. The method makes variant calls based on deviations in sequence fragment sizes, allowing the analysis of repeats at lengths of relevance to a range of phenotypes. We demonstrate the method’s ability to detect and quantify changes in repeat lengths from short read genomic sequence data across genotypes. We use the method to estimate repeat variation among 12 strains of Arabidopsis thaliana and demonstrate experimentally that our method compares favourably against existing methods. Using this method, we have identified all repeats across the genome, which are likely to be polymorphic. In addition, our predicted polymorphic repeats also included the only known repeat expansion in A. thaliana, suggesting an ability to discover potential unstable repeats.
Collapse
Affiliation(s)
- Minh Duc Cao
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, St Lucia QLD 4072, Australia, Clayton School of Information Technology, Monash University, Clayton, VIC 3800, Australia, School of Biological Sciences, Monash University, Melbourne, Australia and Advanced Water Management Centre, The University of Queensland, Queensland, Australia
| | | | | | | | | | | | | | | |
Collapse
|
48
|
Dorn C, Grunert M, Sperling SR. Application of high-throughput sequencing for studying genomic variations in congenital heart disease. Brief Funct Genomics 2013; 13:51-65. [PMID: 24095982 DOI: 10.1093/bfgp/elt040] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Congenital heart diseases (CHD) represent the most common birth defect in human. The majority of cases are caused by a combination of complex genetic alterations and environmental influences. In the past, many disease-causing mutations have been identified; however, there is still a large proportion of cardiac malformations with unknown precise origin. High-throughput sequencing technologies established during the last years offer novel opportunities to further study the genetic background underlying the disease. In this review, we provide a roadmap for designing and analyzing high-throughput sequencing studies focused on CHD, but also with general applicability to other complex diseases. The three main next-generation sequencing (NGS) platforms including their particular advantages and disadvantages are presented. To identify potentially disease-related genomic variations and genes, different filtering steps and gene prioritization strategies are discussed. In addition, available control datasets based on NGS are summarized. Finally, we provide an overview of current studies already using NGS technologies and showing that these techniques will help to further unravel the complex genetics underlying CHD.
Collapse
Affiliation(s)
- Cornelia Dorn
- Department of Cardiovascular Genetics, Experimental and Clinical Research Center (ECRC), Charité-University Medicine Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Lindenberger Weg 80, 13125 Berlin, Germany. Department of Biochemistry, Free University Berlin, Berlin, Germany. Tel.: +49-(0)30-450540123; Fax: +49-(0)30-84131699;
| | | | | |
Collapse
|
49
|
Chong Z, Zhai W, Li C, Gao M, Gong Q, Ruan J, Li J, Jiang L, Lv X, Hungate E, Wu CI. The evolution of small insertions and deletions in the coding genes of Drosophila melanogaster. Mol Biol Evol 2013; 30:2699-708. [PMID: 24077769 DOI: 10.1093/molbev/mst167] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
Studies of protein evolution have focused on amino acid substitutions with much less systematic analysis on insertion and deletions (indels) in protein coding genes. We hence surveyed 7,500 genes between Drosophila melanogaster and D. simulans, using D. yakuba as an outgroup for this purpose. The evolutionary rate of coding indels is indeed low, at only 3% of that of nonsynonymous substitutions. As coding indels follow a geometric distribution in size and tend to fall in low-complexity regions of proteins, it is unclear whether selection or mutation underlies this low rate. To resolve the issue, we collected genomic sequences from an isogenic African line of D. melanogaster (ZS30) at a high coverage of 70× and analyzed indel polymorphism between ZS30 and the reference genome. In comparing polymorphism and divergence, we found that the divergence to polymorphism ratio (i.e., fixation index) for smaller indels (size ≤ 10 bp) is very similar to that for synonymous changes, suggesting that most of the within-species polymorphism and between-species divergence for indels are selectively neutral. Interestingly, deletions of larger sizes (size ≥ 11 bp and ≤ 30 bp) have a much higher fixation index than synonymous mutations and 44.4% of fixed middle-sized deletions are estimated to be adaptive. To our surprise, this pattern is not found for insertions. Protein indel evolution appear to be in a dynamic flux of neutrally driven expansion (insertions) together with adaptive-driven contraction (deletions), and these observations provide important insights for understanding the fitness of new mutations as well as the evolutionary driving forces for genomic evolution in Drosophila species.
Collapse
Affiliation(s)
- Zechen Chong
- Center for Computational Biology and Laboratory of Disease Genomics and Individualized Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Rogan PK, Zou GY. Best Practices for Evaluating Mutation Prediction Methods. Hum Mutat 2013; 34:1581-2. [DOI: 10.1002/humu.22401] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2012] [Accepted: 08/13/2013] [Indexed: 01/07/2023]
Affiliation(s)
- Peter K. Rogan
- Departments of Biochemistry; Schulich School of Medicine and Dentistry, Western University; Ontario Canada
- Department of Computer Science; Faculty of Science, Western University; Ontario Canada
| | - Guang Yong Zou
- Department of Epidemiology and Biostatistics; Schulich School of Medicine and Dentistry, Western University; Ontario Canada
| |
Collapse
|