1
|
Xie S, Isaacs K, Becker G, Murdoch BM. A computational framework for improving genetic variants identification from 5,061 sheep sequencing data. J Anim Sci Biotechnol 2023; 14:127. [PMID: 37779189 PMCID: PMC10544426 DOI: 10.1186/s40104-023-00923-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Accepted: 08/01/2023] [Indexed: 10/03/2023] Open
Abstract
BACKGROUND Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping. RESULTS In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154). CONCLUSION The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.
Collapse
Affiliation(s)
- Shangqian Xie
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA
| | | | - Gabrielle Becker
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA
| | - Brenda M Murdoch
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA.
| |
Collapse
|
2
|
Özdemir Özdoğan G, Kaya H. Next-Generation Sequencing Data Analysis on Pool-Seq and Low-Coverage Retinoblastoma Data. Interdiscip Sci 2020; 12:302-310. [PMID: 32519123 DOI: 10.1007/s12539-020-00374-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2019] [Revised: 04/26/2020] [Accepted: 05/22/2020] [Indexed: 12/31/2022]
Abstract
Next-generation sequencing (NGS) is related to massively parallel or deep deoxyribonucleic acid (DNA) sequencing technology which has revolutionized genomic researches in recent years. Although the cost of generating NGS data was decreased compared to the one at the time of emerging this technology, its cost might still be somewhat a problem. Hence, new strategies as pool-seq and low-coverage NGS data have been developed to overcome the cost problem. Despite decreasing cost, it is important to elucidate whether they are efficient in NGS studies. We applied a bioinformatics pipeline on pool-seq and low-coverage retinoblastoma data retrieved from only tumor data. Retinoblastoma is an eye malignancy in childhood that is initiated by RB1 mutation or MYCN amplification and can lead to the loss of vision of eye(s), and even sometimes life. We applied our pipeline on both retinoblastoma disease data and two other particular data to testify the validity and also for comparison purposes in the aspect of performance. High-confidence variant calls from Genome in a Bottle Consortium were used for fulfilling these purposes. We observed that our pipeline successfully called higher number of variants than a standard pipeline for all these three different data. Besides, the recall and F-score values were quite better in our pipeline as being noteworthy. We further presented our results on disease data in the aspects of the variants, variant types and disease-related genes. This study provides a guideline for performing NGS data analysis pipeline on pool-seq and low-coverage sequencing data in conjunction. To get more conclusive outcomes of these two strategies, we recommend using cancer data having higher mutation rates and larger pools.
Collapse
Affiliation(s)
| | - Hilal Kaya
- Department of Computer Engineering, Ankara Yildirim Beyazit University, 06010, Ankara, Turkey.
| |
Collapse
|
3
|
Alzu'bi AA, Zhou L, Watzlaf VJM. Genetic Variations and Precision Medicine. PERSPECTIVES IN HEALTH INFORMATION MANAGEMENT 2019; 16:1a. [PMID: 31019429 PMCID: PMC6462879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The time and costs associated with the sequencing of a human genome have decreased significantly in recent years. Many people have chosen to have their genomes sequenced to receive genomics-based personalized healthcare services. To reach the goal of genomics-based precision medicine, health information management (HIM) professionals need to manage and analyze patients' genomic data. Two important pieces of information from the genome sequence are the risk of genetic diseases and the specific medication or pharmacogenomic results for the individual patient, both of which are linked to a patient's genetic variations. In this review article, we introduce genetic variations, including their data types, relevant databases, and some currently available analysis methods and systems. HIM professionals can choose to use these databases, methods, and systems in the management and analysis of patients' genomic data.
Collapse
Affiliation(s)
- Amal Adel Alzu'bi
- The Department of Computer Information Systems at Jordan University of Science and Technology in Irbid, Jordan
| | - Leming Zhou
- The Department of Health Information Management at the University of Pittsburgh in Pittsburgh, PA
| | - Valerie J M Watzlaf
- The Department of Health Information Management at the University of Pittsburgh in Pittsburgh, PA
| |
Collapse
|
4
|
Abstract
Abstract
Poisson-like behavior for event count data is ubiquitous in nature. At the same time, differencing of such counts arises in the course of data processing in a variety of areas of application. As a result, the Skellam distribution – defined as the distribution of the difference of two independent Poisson random variables – is a natural candidate for approximating the difference of Poisson-like event counts. However, in many contexts strict independence, whether between counts or among events within counts, is not a tenable assumption. Here we characterize the accuracy in approximating the difference of Poisson-like counts by a Skellam random variable. Our results fully generalize existing, more limited, results in this direction and, at the same time, our derivations are significantly more concise and elegant. We illustrate the potential impact of these results in the context of problems from network analysis and image processing, where various forms of weak dependence can be expected.
Collapse
|
5
|
Gupta P, Reddaiah B, Salava H, Upadhyaya P, Tyagi K, Sarma S, Datta S, Malhotra B, Thomas S, Sunkum A, Devulapalli S, Till BJ, Sreelakshmi Y, Sharma R. Next-generation sequencing (NGS)-based identification of induced mutations in a doubly mutagenized tomato (Solanum lycopersicum) population. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2017; 92:495-508. [PMID: 28779536 DOI: 10.1111/tpj.13654] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Revised: 07/25/2017] [Accepted: 07/26/2017] [Indexed: 05/21/2023]
Abstract
The identification of mutations in targeted genes has been significantly simplified by the advent of TILLING (Targeting Induced Local Lesions In Genomes), speeding up the functional genomic analysis of animals and plants. Next-generation sequencing (NGS) is gradually replacing classical TILLING for mutation detection, as it allows the analysis of a large number of amplicons in short durations. The NGS approach was used to identify mutations in a population of Solanum lycopersicum (tomato) that was doubly mutagenized by ethylmethane sulphonate (EMS). Twenty-five genes belonging to carotenoids and folate metabolism were PCR-amplified and screened to identify potentially beneficial alleles. To augment efficiency, the 600-bp amplicons were directly sequenced in a non-overlapping manner in Illumina MiSeq, obviating the need for a fragmentation step before library preparation. A comparison of the different pooling depths revealed that heterozygous mutations could be identified up to 128-fold pooling. An evaluation of six different software programs (camba, crisp, gatk unified genotyper, lofreq, snver and vipr) revealed that no software program was robust enough to predict mutations with high fidelity. Among these, crisp and camba predicted mutations with lower false discovery rates. The false positives were largely eliminated by considering only mutations commonly predicted by two different software programs. The screening of 23.47 Mb of tomato genome yielded 75 predicted mutations, 64 of which were confirmed by Sanger sequencing with an average mutation density of 1/367 Kb. Our results indicate that NGS combined with multiple variant detection tools can reduce false positives and significantly speed up the mutation discovery rate.
Collapse
Affiliation(s)
- Prateek Gupta
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Bodanapu Reddaiah
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Hymavathi Salava
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Pallawi Upadhyaya
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Kamal Tyagi
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Supriya Sarma
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Sneha Datta
- Plant Breeding and Genetics Laboratory, IAEA Seibersdorf Laboratories, Reaktorstrasse 1, Seibersdorf, Austria
| | - Bharti Malhotra
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Sherinmol Thomas
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Anusha Sunkum
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Sameera Devulapalli
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Bradley John Till
- Plant Breeding and Genetics Laboratory, IAEA Seibersdorf Laboratories, Reaktorstrasse 1, Seibersdorf, Austria
| | - Yellamaraju Sreelakshmi
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| | - Rameshwar Sharma
- Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India
| |
Collapse
|
6
|
Kugelman JR, Wiley MR, Nagle ER, Reyes D, Pfeffer BP, Kuhn JH, Sanchez-Lockhart M, Palacios GF. Error baseline rates of five sample preparation methods used to characterize RNA virus populations. PLoS One 2017; 12:e0171333. [PMID: 28182717 PMCID: PMC5300104 DOI: 10.1371/journal.pone.0171333] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 01/18/2017] [Indexed: 11/19/2022] Open
Abstract
Individual RNA viruses typically occur as populations of genomes that differ slightly from each other due to mutations introduced by the error-prone viral polymerase. Understanding the variability of RNA virus genome populations is critical for understanding virus evolution because individual mutant genomes may gain evolutionary selective advantages and give rise to dominant subpopulations, possibly even leading to the emergence of viruses resistant to medical countermeasures. Reverse transcription of virus genome populations followed by next-generation sequencing is the only available method to characterize variation for RNA viruses. However, both steps may lead to the introduction of artificial mutations, thereby skewing the data. To better understand how such errors are introduced during sample preparation, we determined and compared error baseline rates of five different sample preparation methods by analyzing in vitro transcribed Ebola virus RNA from an artificial plasmid-based system. These methods included: shotgun sequencing from plasmid DNA or in vitro transcribed RNA as a basic “no amplification” method, amplicon sequencing from the plasmid DNA or in vitro transcribed RNA as a “targeted” amplification method, sequence-independent single-primer amplification (SISPA) as a “random” amplification method, rolling circle reverse transcription sequencing (CirSeq) as an advanced “no amplification” method, and Illumina TruSeq RNA Access as a “targeted” enrichment method. The measured error frequencies indicate that RNA Access offers the best tradeoff between sensitivity and sample preparation error (1.4−5) of all compared methods.
Collapse
Affiliation(s)
- Jeffrey R. Kugelman
- Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), Fort Detrick, Frederick, Maryland, United States of America
| | - Michael R. Wiley
- Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), Fort Detrick, Frederick, Maryland, United States of America
| | - Elyse R. Nagle
- Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), Fort Detrick, Frederick, Maryland, United States of America
| | - Daniel Reyes
- Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), Fort Detrick, Frederick, Maryland, United States of America
| | - Brad P. Pfeffer
- Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), Fort Detrick, Frederick, Maryland, United States of America
| | - Jens H. Kuhn
- Integrated Research Facility at Fort Detrick (IRF-Frederick), National Institute of Allergy and Infectious Diseases, National Institutes of Health, Fort Detrick, Frederick, Maryland, United States of America
| | - Mariano Sanchez-Lockhart
- Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), Fort Detrick, Frederick, Maryland, United States of America
| | - Gustavo F. Palacios
- Center for Genome Sciences, United States Army Medical Research Institute of Infectious Diseases (USAMRIID), Fort Detrick, Frederick, Maryland, United States of America
- * E-mail:
| |
Collapse
|
7
|
Artigas MS, Wain LV, Shrine N, McKeever TM, UK BiLEVE, Sayers I, Hall IP, Tobin MD. Targeted Sequencing of Lung Function Loci in Chronic Obstructive Pulmonary Disease Cases and Controls. PLoS One 2017; 12:e0170222. [PMID: 28114305 PMCID: PMC5256917 DOI: 10.1371/journal.pone.0170222] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2016] [Accepted: 01/01/2017] [Indexed: 12/15/2022] Open
Abstract
Chronic obstructive pulmonary disease (COPD) is the third leading cause of death worldwide; smoking is the main risk factor for COPD, but genetic factors are also relevant contributors. Genome-wide association studies (GWAS) of the lung function measures used in the diagnosis of COPD have identified a number of loci, however association signals are often broad and collectively these loci only explain a small proportion of the heritability. In order to examine the association with COPD risk of genetic variants down to low allele frequencies, to aid fine-mapping of association signals and to explain more of the missing heritability, we undertook a targeted sequencing study in 300 COPD cases and 300 smoking controls for 26 loci previously reported to be associated with lung function. We used a pooled sequencing approach, with 12 pools of 25 individuals each, enabling high depth (30x) coverage per sample to be achieved. This pooled design maximised sample size and therefore power, but led to challenges during variant-calling since sequencing error rates and minor allele frequencies for rare variants can be very similar. For this reason we employed a rigorous quality control pipeline for variant detection which included the use of 3 independent calling algorithms. In order to avoid false positive associations we also developed tests to detect variants with potential batch effects and removed them before undertaking association testing. We tested for the effects of single variants and the combined effect of rare variants within a locus. We followed up the top signals with data available (only 67% of collapsing methods signals) in 4,249 COPD cases and 11,916 smoking controls from UK Biobank. We provide suggestive evidence for the combined effect of rare variants on COPD risk in TNXB and in sliding windows within MECOM and upstream of HHIP. These findings can lead to an improved understanding of the molecular pathways involved in the development of COPD.
Collapse
Affiliation(s)
- María Soler Artigas
- Genetic Epidemiology Group, Department of Health Sciences, University of Leicester, Leicester, United Kingdom
| | - Louise V. Wain
- Genetic Epidemiology Group, Department of Health Sciences, University of Leicester, Leicester, United Kingdom
- National Institute for Health Research (NIHR), Leicester Respiratory Biomedical Research Unit, Glenfield Hospital, Leicester, United Kingdom
| | - Nick Shrine
- Genetic Epidemiology Group, Department of Health Sciences, University of Leicester, Leicester, United Kingdom
| | - Tricia M. McKeever
- Division of Respiratory Medicine, Queen’s Medical Centre, University of Nottingham, Nottingham, United Kingdom
| | | | - Ian Sayers
- Division of Respiratory Medicine, Queen’s Medical Centre, University of Nottingham, Nottingham, United Kingdom
| | - Ian P. Hall
- Division of Respiratory Medicine, Queen’s Medical Centre, University of Nottingham, Nottingham, United Kingdom
| | - Martin D. Tobin
- Genetic Epidemiology Group, Department of Health Sciences, University of Leicester, Leicester, United Kingdom
- National Institute for Health Research (NIHR), Leicester Respiratory Biomedical Research Unit, Glenfield Hospital, Leicester, United Kingdom
| |
Collapse
|
8
|
Duitama J, Kafuri L, Tello D, Leiva AM, Hofinger B, Datta S, Lentini Z, Aranzales E, Till B, Ceballos H. Deep Assessment of Genomic Diversity in Cassava for Herbicide Tolerance and Starch Biosynthesis. Comput Struct Biotechnol J 2017; 15:185-194. [PMID: 28179981 PMCID: PMC5295625 DOI: 10.1016/j.csbj.2017.01.002] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 12/26/2016] [Accepted: 01/10/2017] [Indexed: 12/16/2022] Open
Abstract
Cassava is one of the most important food security crops in tropical countries, and a competitive resource for the starch, food, feed and ethanol industries. However, genomics research in this crop is much less developed compared to other economically important crops such as rice or maize. The International Center for Tropical Agriculture (CIAT) maintains the largest cassava germplasm collection in the world. Unfortunately, the genetic potential of this diversity for breeding programs remains underexploited due to the difficulties in phenotypic screening and lack of deep genomic information about the different accessions. A chromosome-level assembly of the cassava reference genome was released this year and only a handful of studies have been made, mainly to find quantitative trait loci (QTL) on breeding populations with limited variability. This work presents the results of pooled targeted resequencing of more than 1500 cassava accessions from the CIAT germplasm collection to obtain a dataset of more than 2000 variants within genes related to starch functional properties and herbicide tolerance. Results of twelve bioinformatic pipelines for variant detection in pooled samples were compared to ensure the quality of the variant calling process. Predictions of functional impact were performed using two separate methods to prioritize interesting variation for genotyping and cultivar selection. Targeted resequencing, either by pooled samples or by similar approaches such as Ecotilling or capture, emerges as a cost effective alternative to whole genome sequencing to identify interesting alleles of genes related to relevant traits within large germplasm collections.
Collapse
Affiliation(s)
- Jorge Duitama
- Agrobiodiversity Research Area, International Center for Tropical Agriculture (CIAT), Cali, Colombia
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia
| | - Lina Kafuri
- Plant Breeding and Genetics Laboratory, Joint FAO/IAEA Division, International Atomic Energy Agency, Seibersdorf, Austria
- Department of Biological Sciences, School of Natural Sciences, Universidad Icesi, Cali, Colombia
| | - Daniel Tello
- Plant Breeding and Genetics Laboratory, Joint FAO/IAEA Division, International Atomic Energy Agency, Seibersdorf, Austria
- Department of Biological Sciences, School of Natural Sciences, Universidad Icesi, Cali, Colombia
| | - Ana María Leiva
- Agrobiodiversity Research Area, International Center for Tropical Agriculture (CIAT), Cali, Colombia
| | - Bernhard Hofinger
- Plant Breeding and Genetics Laboratory, Joint FAO/IAEA Division, International Atomic Energy Agency, Seibersdorf, Austria
| | - Sneha Datta
- Plant Breeding and Genetics Laboratory, Joint FAO/IAEA Division, International Atomic Energy Agency, Seibersdorf, Austria
| | - Zaida Lentini
- Department of Biological Sciences, School of Natural Sciences, Universidad Icesi, Cali, Colombia
| | - Ericson Aranzales
- Agrobiodiversity Research Area, International Center for Tropical Agriculture (CIAT), Cali, Colombia
| | - Bradley Till
- Plant Breeding and Genetics Laboratory, Joint FAO/IAEA Division, International Atomic Energy Agency, Seibersdorf, Austria
| | - Hernán Ceballos
- Agrobiodiversity Research Area, International Center for Tropical Agriculture (CIAT), Cali, Colombia
| |
Collapse
|
9
|
Functional Impact of An ADHD-Associated DIRAS2 Promoter Polymorphism. Neuropsychopharmacology 2016; 41:3025-3031. [PMID: 27364329 PMCID: PMC5101550 DOI: 10.1038/npp.2016.113] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/21/2016] [Revised: 06/20/2016] [Accepted: 06/24/2016] [Indexed: 12/17/2022]
Abstract
The DIRAS2 gene is coding for a small Ras GTPase with so far unknown function. In a previous study, we described the association of DIRAS2 rs1412005, as well as a haplotype containing this polymorphism and located in the promoter region of this gene with attention-deficit/hyperactivity disorder (ADHD). In the present study, we searched for rare variants within or near the DIRAS2 gene that might be associated with ADHD using next-generation sequencing. As we were not able to detect any rare variants associated with the disease, we sought to establish a functional role of DIRAS2 rs1412005 on the molecular or systems level. First, we investigated whether it has an influence on gene expression by means of a luciferase-based promoter assay. We could demonstrate that the minor risk allele goes along with the increased expression of the reporter gene. Next, we aimed to identify differences in response inhibition between risk-allele and non-risk allele carriers in children suffering from ADHD and healthy control individuals by analyzing event-related potentials in the electroencephalogram during a Go/NoGo task. Risk-allele carriers showed a changed NoGo anteriorization. Therefore, our results suggest an impact of the investigated polymorphism on the prefrontal response control in children with ADHD. These results imply that the promoter polymorphism is indeed the associated as well as in itself a causal variant. Further research is thus warranted to clarify the mechanisms linking DIRAS2 to ADHD.
Collapse
|
10
|
Greenwood JM, Ezquerra AL, Behrens S, Branca A, Mallet L. Current analysis of host–parasite interactions with a focus on next generation sequencing data. ZOOLOGY 2016; 119:298-306. [DOI: 10.1016/j.zool.2016.06.010] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Revised: 06/22/2016] [Accepted: 06/22/2016] [Indexed: 01/21/2023]
|
11
|
Gao L, Li L, Ye J, Zhu X, Shen N, Zhang X, Wang D, Gao Y, Lin H, Wang Y, Liu Y. Identification of a novel mutation in PLA2G6 gene in a Chinese pedigree with familial cortical myoclonic tremor with epilepsy. Seizure 2016; 41:81-5. [PMID: 27513994 DOI: 10.1016/j.seizure.2016.07.013] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2016] [Revised: 07/21/2016] [Accepted: 07/21/2016] [Indexed: 11/26/2022] Open
Abstract
PURPOSE Familial cortical myoclonic tremor with epilepsy (FCMTE) is an epileptic syndrome with autosomal dominant inheritance, of which four genetic subtypes (FCMTE1-4) have been reported. In the present study, we described the clinical and neurophysiologic features of a newly diagnosed Chinese FCMTE family, and investigated the genetic cause for this disease. METHODS Clinical information was obtained from affected and normal individuals of an FCMTE family comprising 41 members. Electroencephalographies were analyzed in five of six affected members (including the proband). Brain magnetic resonance imaging, somatosensory evoked potential with C-reflex analysis and magnetoencephalography was performed in the proband. Genomic DNA of three affected and two unaffected individuals was analyzed to detect the genetic mutations by using whole-exome sequencing. RESULTS The inheritance pattern of the pedigree was autosomal dominant. A novel missense mutation c.475C>T (p.Ala159Thr) of PLA2G6 were identified in this family. The mutated locus is highly conserved among other species. The mutation is predicted to have a functional impact, and completely co-segregated with the phenotype. CONCLUSION This study identifies a novel PLA2G6 mutation that is the possible genetic cause of FCMTE in this family. This mutation and associated clinical features expand the spectrum and phenotypes of PLA2G6-related disorders including neurodegenerative diseases.
Collapse
Affiliation(s)
- Lehong Gao
- Department of Neurology, Xuanwu Hospital, Capital Medical University, The Beijing Key Laboratory of Neuromodulation, Beijing 100053, China
| | - Liping Li
- Department of Neurology, Xuanwu Hospital, Capital Medical University, The Beijing Key Laboratory of Neuromodulation, Beijing 100053, China
| | - Jing Ye
- Department of Neurology, Xuanwu Hospital, Capital Medical University, The Beijing Key Laboratory of Neuromodulation, Beijing 100053, China
| | - Xilin Zhu
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100101, China
| | - Ning Shen
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100101, China
| | - Xiating Zhang
- Department of Neurology, Xuanwu Hospital, Capital Medical University, The Beijing Key Laboratory of Neuromodulation, Beijing 100053, China
| | - Dequan Wang
- Department of Neurology, Xuanwu Hospital, Capital Medical University, The Beijing Key Laboratory of Neuromodulation, Beijing 100053, China
| | - Yu Gao
- Department of Neurology, Xuanwu Hospital, Capital Medical University, The Beijing Key Laboratory of Neuromodulation, Beijing 100053, China
| | - Hua Lin
- Department of Neurology, Xuanwu Hospital, Capital Medical University, The Beijing Key Laboratory of Neuromodulation, Beijing 100053, China
| | - Yuping Wang
- Department of Neurology, Xuanwu Hospital, Capital Medical University, The Beijing Key Laboratory of Neuromodulation, Beijing 100053, China.
| | - Ying Liu
- State Key Laboratory of Medical Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100101, China.
| |
Collapse
|
12
|
Abstract
Background High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like FST, but these measures are not ideal for the task. Results Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms. Conclusion The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence.
Collapse
|
13
|
Shearer A, Eppsteiner R, Booth K, Ephraim S, Gurrola J, Simpson A, Black-Ziegelbein E, Joshi S, Ravi H, Giuffre A, Happe S, Hildebrand M, Azaiez H, Bayazit Y, Erdal M, Lopez-Escamez J, Gazquez I, Tamayo M, Gelvez N, Leal G, Jalas C, Ekstein J, Yang T, Usami SI, Kahrizi K, Bazazzadegan N, Najmabadi H, Scheetz T, Braun T, Casavant T, LeProust E, Smith R. Utilizing ethnic-specific differences in minor allele frequency to recategorize reported pathogenic deafness variants. Am J Hum Genet 2014; 95:445-53. [PMID: 25262649 DOI: 10.1016/j.ajhg.2014.09.001] [Citation(s) in RCA: 120] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2014] [Accepted: 09/08/2014] [Indexed: 10/24/2022] Open
Abstract
Ethnic-specific differences in minor allele frequency impact variant categorization for genetic screening of nonsyndromic hearing loss (NSHL) and other genetic disorders. We sought to evaluate all previously reported pathogenic NSHL variants in the context of a large number of controls from ethnically distinct populations sequenced with orthogonal massively parallel sequencing methods. We used HGMD, ClinVar, and dbSNP to generate a comprehensive list of reported pathogenic NSHL variants and re-evaluated these variants in the context of 8,595 individuals from 12 populations and 6 ethnically distinct major human evolutionary phylogenetic groups from three sources (Exome Variant Server, 1000 Genomes project, and a control set of individuals created for this study, the OtoDB). Of the 2,197 reported pathogenic deafness variants, 325 (14.8%) were present in at least one of the 8,595 controls, indicating a minor allele frequency (MAF) > 0.00006. MAFs ranged as high as 0.72, a level incompatible with pathogenicity for a fully penetrant disease like NSHL. Based on these data, we established MAF thresholds of 0.005 for autosomal-recessive variants (excluding specific variants in GJB2) and 0.0005 for autosomal-dominant variants. Using these thresholds, we recategorized 93 (4.2%) of reported pathogenic variants as benign. Our data show that evaluation of reported pathogenic deafness variants using variant MAFs from multiple distinct ethnicities and sequenced by orthogonal methods provides a powerful filter for determining pathogenicity. The proposed MAF thresholds will facilitate clinical interpretation of variants identified in genetic testing for NSHL. All data are publicly available to facilitate interpretation of genetic variants causing deafness.
Collapse
|
14
|
Sequencing pools of individuals — mining genome-wide polymorphism data without big funding. Nat Rev Genet 2014; 15:749-63. [DOI: 10.1038/nrg3803] [Citation(s) in RCA: 512] [Impact Index Per Article: 46.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
15
|
Zhao Z, Wang W, Wei Z. An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data. Ann Appl Stat 2013. [DOI: 10.1214/13-aoas660] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
Konczal M, Koteja P, Stuglik MT, Radwan J, Babik W. Accuracy of allele frequency estimation using pooled RNA-Seq. Mol Ecol Resour 2013; 14:381-92. [PMID: 24119300 DOI: 10.1111/1755-0998.12186] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2013] [Revised: 09/30/2013] [Accepted: 10/06/2013] [Indexed: 11/28/2022]
Abstract
For nonmodel organisms, genome-wide information that describes functionally relevant variation may be obtained by RNA-Seq following de novo transcriptome assembly. While sequencing has become relatively inexpensive, the preparation of a large number of sequencing libraries remains prohibitively expensive for population genetic analyses of nonmodel species. Pooling samples may be then an attractive alternative. To test whether pooled RNA-Seq accurately predicts true allele frequencies, we analysed the liver transcriptomes of 10 bank voles. Each sample was sequenced both as an individually barcoded library and as a part of a pool. Equal amounts of total RNA from each vole were pooled prior to mRNA selection and library construction. Reads were mapped onto the de novo assembled reference transcriptome. High-quality genotypes for individual voles, determined for 23,682 SNPs, provided information on 'true' allele frequencies; allele frequencies estimated from the pool were then compared with these values. 'True' frequencies and those estimated from the pool were highly correlated. Mean relative estimation error was 21% and did not depend on expression level. However, we also observed a minor effect of interindividual variation in gene expression and allele-specific gene expression influencing allele frequency estimation accuracy. Moreover, we observed strong negative relationship between minor allele frequency and relative estimation error. Our results indicate that pooled RNA-Seq exhibits accuracy comparable with pooled genome resequencing, but variation in expression level between individuals should be assessed and accounted for. This should help in taking account the difference in accuracy between conservatively expressed transcripts and these which are variable in expression level.
Collapse
Affiliation(s)
- M Konczal
- Institute of Environmental Sciences, Jagiellonian University, Gronostajowa 7, 30-387, Kraków, Poland
| | | | | | | | | |
Collapse
|
17
|
Ferretti L, Ramos-Onsins SE, Pérez-Enciso M. Population genomics from pool sequencing. Mol Ecol 2013; 22:5561-76. [DOI: 10.1111/mec.12522] [Citation(s) in RCA: 109] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2011] [Revised: 08/03/2013] [Accepted: 09/06/2013] [Indexed: 11/30/2022]
Affiliation(s)
- Luca Ferretti
- Center for Research in Agricultural Genomics (CRAG); UAB 08193 Bellaterra Spain
| | | | - Miguel Pérez-Enciso
- Center for Research in Agricultural Genomics (CRAG); UAB 08193 Bellaterra Spain
- Department of Animal Science and Food; Faculty of Veterinary; Universitat Autonoma de Barcelona; 08193 Bellaterra Spain
- Institut Català de Recerca i Estudis Avancats (ICREA); Passeig Lluís Companys 23 08010 Barcelona Spain
| |
Collapse
|
18
|
Yu X, Sun S. Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinformatics 2013; 14:274. [PMID: 24044377 PMCID: PMC3848615 DOI: 10.1186/1471-2105-14-274] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Accepted: 09/12/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. RESULTS To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs' quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. CONCLUSIONS Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.
Collapse
Affiliation(s)
- Xiaoqing Yu
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44106, USA.
| | | |
Collapse
|
19
|
Morata J, Béjar S, Talavera D, Riera C, Lois S, de Xaxars GM, de la Cruz X. The relationship between gene isoform multiplicity, number of exons and protein divergence. PLoS One 2013; 8:e72742. [PMID: 24023641 PMCID: PMC3758341 DOI: 10.1371/journal.pone.0072742] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2013] [Accepted: 07/14/2013] [Indexed: 11/18/2022] Open
Abstract
At present we know that phenotypic differences between organisms arise from a variety of sources, like protein sequence divergence, regulatory sequence divergence, alternative splicing, etc. However, we do not have yet a complete view of how these sources are related. Here we address this problem, studying the relationship between protein divergence and the ability of genes to express multiple isoforms. We used three genome-wide datasets of human-mouse orthologs to study the relationship between isoform multiplicity co-occurrence between orthologs (the fact that two orthologs have more than one isoform) and protein divergence. In all cases our results showed that there was a monotonic dependence between these two properties. We could explain this relationship in terms of a more fundamental one, between exon number of the largest isoform and protein divergence. We found that this last relationship was present, although with variations, in other species (chimpanzee, cow, rat, chicken, zebrafish and fruit fly). In summary, we have identified a relationship between protein divergence and isoform multiplicity co-occurrence and explained its origin in terms of a simple gene-level property. Finally, we discuss the biological implications of these findings for our understanding of inter-species phenotypic differences.
Collapse
Affiliation(s)
- Jordi Morata
- Department of Structural Biology, Institut de Biologia Molecular de Barcelona (IBMB)-Consejo Superior de Investigaciones Científicas (CSIC), Barcelona, Spain
| | - Santi Béjar
- Department of Structural Biology, Institut de Biologia Molecular de Barcelona (IBMB)-Consejo Superior de Investigaciones Científicas (CSIC), Barcelona, Spain
| | - David Talavera
- Faculty of Life Sciences, Manchester University, Manchester, United Kingdom
| | - Casandra Riera
- Laboratory of Translational Bioinformatics in Neuroscience, Vall d'Hebron Institute of Research (VHIR), Barcelona, Spain
| | - Sergio Lois
- Laboratory of Translational Bioinformatics in Neuroscience, Vall d'Hebron Institute of Research (VHIR), Barcelona, Spain
| | - Gemma Mas de Xaxars
- Laboratori de Botànica, Facultat de Farmàcia, Universitat de Barcelona, Barcelona, Spain
| | - Xavier de la Cruz
- Department of Structural Biology, Institut de Biologia Molecular de Barcelona (IBMB)-Consejo Superior de Investigaciones Científicas (CSIC), Barcelona, Spain
- Laboratory of Translational Bioinformatics in Neuroscience, Vall d'Hebron Institute of Research (VHIR), Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
- * E-mail:
| |
Collapse
|
20
|
Miyagawa M, Naito T, Nishio SY, Kamatani N, Usami SI. Targeted exon sequencing successfully discovers rare causative genes and clarifies the molecular epidemiology of Japanese deafness patients. PLoS One 2013; 8:e71381. [PMID: 23967202 PMCID: PMC3742761 DOI: 10.1371/journal.pone.0071381] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2012] [Accepted: 06/30/2013] [Indexed: 11/19/2022] Open
Abstract
Target exon resequencing using Massively Parallel DNA Sequencing (MPS) is a new powerful strategy to discover causative genes in rare Mendelian disorders such as deafness. We attempted to identify genomic variations responsible for deafness by massive sequencing of the exons of 112 target candidate genes. By the analysis of 216randomly selected Japanese deafness patients (120 early-onset and 96 late-detected), who had already been evaluated for common genes/mutations by Invader assay and of which 48 had already been diagnosed, we efficiently identified causative mutations and/or mutation candidates in 57 genes. Approximately 86.6% (187/216) of the patients had at least one mutation. Of the 187 patients, in 69 the etiology of the hearing loss was completely explained. To determine which genes have the greatest impact on deafness etiology, the number of mutations was counted, showing that those in GJB2 were exceptionally higher, followed by mutations in SLC26A4, USH2A, GPR98, MYO15A, COL4A5 and CDH23. The present data suggested that targeted exon sequencing of selected genes using the MPS technology followed by the appropriate filtering algorithm will be able to identify rare responsible genes including new candidate genes for individual patients with deafness, and improve molecular diagnosis. In addition, using a large number of patients, the present study clarified the molecular epidemiology of deafness in Japanese. GJB2 is the most prevalent causative gene, and the major (commonly found) gene mutations cause 30–40% of deafness while the remainder of hearing loss is the result of various rare genes/mutations that have been difficult to diagnose by the conventional one-by-one approach. In conclusion, target exon resequencing using MPS technology is a suitable method to discover common and rare causative genes for a highly heterogeneous monogenic disease like hearing loss.
Collapse
Affiliation(s)
- Maiko Miyagawa
- Department of Otorhinolaryngology, Shinshu University School of Medicine, Asahi, Matsumoto, Japan
| | - Takehiko Naito
- Department of Otorhinolaryngology, Shinshu University School of Medicine, Asahi, Matsumoto, Japan
| | - Shin-ya Nishio
- Department of Otorhinolaryngology, Shinshu University School of Medicine, Asahi, Matsumoto, Japan
| | | | - Shin-ichi Usami
- Department of Otorhinolaryngology, Shinshu University School of Medicine, Asahi, Matsumoto, Japan
- * E-mail:
| |
Collapse
|
21
|
Quast C, Cuboni S, Bader D, Altmann A, Weber P, Arloth J, Röh S, Brückl T, Ising M, Kopczak A, Erhardt A, Hausch F, Lucae S, Binder EB. Functional coding variants in SLC6A15, a possible risk gene for major depression. PLoS One 2013; 8:e68645. [PMID: 23874702 PMCID: PMC3712998 DOI: 10.1371/journal.pone.0068645] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2013] [Accepted: 05/30/2013] [Indexed: 11/18/2022] Open
Abstract
SLC6A15 is a neuron-specific neutral amino acid transporter that belongs to the solute carrier 6 gene family. This gene family is responsible for presynaptic re-uptake of the majority of neurotransmitters. Convergent data from human studies, animal models and pharmacological investigations suggest a possible role of SLC6A15 in major depressive disorder. In this work, we explored potential functional variants in this gene that could influence the activity of the amino acid transporter and thus downstream neuronal function and possibly the risk for stress-related psychiatric disorders. DNA from 400 depressed patients and 400 controls was screened for genetic variants using a pooled targeted re-sequencing approach. Results were verified by individual re-genotyping and validated non-synonymous coding variants were tested in an independent sample (N = 1934). Nine variants altering the amino acid sequence were then assessed for their functional effects by measuring SLC6A15 transporter activity in a cellular uptake assay. In total, we identified 405 genetic variants, including twelve non-synonymous variants. While none of the non-synonymous coding variants showed significant differences in case-control associations, two rare non-synonymous variants were associated with a significantly increased maximal (3)H proline uptake as compared to the wildtype sequence. Our data suggest that genetic variants in the SLC6A15 locus change the activity of the amino acid transporter and might thus influence its neuronal function and the risk for stress-related psychiatric disorders. As statistically significant association for rare variants might only be achieved in extremely large samples (N >70,000) functional exploration may shed light on putatively disease-relevant variants.
Collapse
Affiliation(s)
- Carina Quast
- Max Planck Institute of Psychiatry, Munich, Germany.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Chen Q, Sun F. A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms. BMC Genomics 2013; 14 Suppl 1:S1. [PMID: 23369070 PMCID: PMC3549804 DOI: 10.1186/1471-2164-14-s1-s1] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) have identified many common polymorphisms associated with complex traits. However, these associated common variants explain only a small fraction of the phenotypic variances, leaving a substantial portion of genetic heritability unexplained. As a result, searches for "missing" heritability are drawing increasing attention, particularly for rare variant studies that often require a large sample size and, thus, extensive sequencing effort. Although the development of next generation sequencing (NGS) technologies has made it possible to sequence a large number of reads economically and efficiently, it is still often cost prohibitive to sequence thousands of individuals that are generally required for association studies. A more efficient and cost-effective design would involve pooling the genetic materials of multiple individuals together and then sequencing the pools, instead of the individuals. This pooled sequencing approach has improved the plausibility of association studies for rare variants, while, at the same time, posed a great challenge to the pooled sequencing data analysis, essentially because individual sample identity is lost, and NGS sequencing errors could be hard to distinguish from low frequency alleles. RESULTS A unified approach for estimating minor allele frequency, SNP calling and association studies based on pooled sequencing data using an expectation maximization (EM) algorithm is developed in this paper. This approach makes it possible to study the effects of minor allele frequency, sequencing error rate, number of pools, number of individuals in each pool, and the sequencing depth on the estimation accuracy of minor allele frequencies. We show that the naive method of estimating minor allele frequencies by taking the fraction of observed minor alleles can be significantly biased, especially for rare variants. In contrast, our EM approach can give an unbiased estimate of the minor allele frequency under all scenarios studied in this paper. A SNP calling approach, EM-SNP, for pooled sequencing data based on the EM algorithm is then developed and compared with another recent SNP calling method, SNVer. We show that EM-SNP outperforms SNVer in terms of the fraction of db-SNPs among the called SNPs, as well as transition/transversion (Ti/Tv) ratio. Finally, the EM approach is used to study the association between variants and type I diabetes. CONCLUSIONS The EM-based approach for the analysis of pooled sequencing data can accurately estimate minor allele frequencies, call SNPs, and find associations between variants and complex traits. This approach is especially useful for studies involving rare variants.
Collapse
Affiliation(s)
- Quan Chen
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089-2910, USA
| | | |
Collapse
|
23
|
Altmann A, Quast C, Weber P. Detecting rare variants for psychiatric disorders using next generation sequencing: a methods primer. Curr Psychiatry Rep 2013; 15:333. [PMID: 23250814 DOI: 10.1007/s11920-012-0333-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Recent advances in massively parallel sequencing (MPS) have had an extensive impact on research in medical genomics. In particular, the analysis of rare variants using MPS promises to lead to a better understanding of complex disorders. Nevertheless, for meaningful studies that address the genetic basis for neuropsychiatric disorders, at least hundreds of patient samples have to be analyzed. This undertaking is still not feasible for single research groups on a whole-genome scale and in individual samples. Thus, researchers increasingly employ strategies for reducing the amount of sequencing efforts, such as target enrichment and non-barcoded sample pooling. This review provides an overview of current technologies, discusses options for reduced experimental designs, and illustrates the successful application of the presented methodologies in a recent study of panic disorder patients. Thereby, it aims to introduce the emerging field of MPS into neuropsychiatric research and might serve as a guide for further studies.
Collapse
Affiliation(s)
- Andre Altmann
- Department of Neurology & Neurological Sciences, Functional Imaging in Neurodegenerative Disorders Laboratory, Stanford University, Stanford, CA, USA.
| | | | | |
Collapse
|
24
|
Quast C, Altmann A, Weber P, Arloth J, Bader D, Heck A, Pfister H, Müller-Myhsok B, Erhardt A, Binder EB. Rare variants in TMEM132D in a case-control sample for panic disorder. Am J Med Genet B Neuropsychiatr Genet 2012; 159B:896-907. [PMID: 22911938 DOI: 10.1002/ajmg.b.32096] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/08/2012] [Accepted: 08/03/2012] [Indexed: 11/06/2022]
Abstract
Genome-wide association studies have identified common variants associated with common diseases. Most variants, however, explain only a small proportion of the estimated heritability, suggesting that rare variants might contribute to a larger extent to common diseases than assumed to date. Here, we use next-generation sequencing to test whether such variants contribute to the risk for anxiety disorders by re-sequencing 40 kb including all exons of the TMEM132D locus which we have previously shown to be associated with panic disorder and anxiety severity measures. DNA from 300 patients suffering from anxiety disorders, mostly panic disorder (84.7%), and 300 healthy controls was screened for the presence of genetic variants using next-generation re-sequencing in a pooled approach. Results were verified by individual re-genotyping. We identified 371 variants of which 247 had not been reported before, including 15 novel non-synonymous variants. The majority, 76% of these variants had a minor allele frequency less than 5%. While we did not identify additional common variants in TMEM132D associated with panic disorders, we observed an overrepresentation of presumably functional coding variants in healthy controls as compared to cases as well as a higher rate of private coding variants in cases, with one non-synonymous coding variant present in four patients but not in any of the matched controls nor in over 5,500 individuals of different ethnic origins from publicly available re-sequencing datasets. Our data suggest that not only common but also putatively functional and/or rare variants within TMEM132D might contribute to the risk to develop anxiety disorders.
Collapse
Affiliation(s)
- Carina Quast
- Max Planck Institute of Psychiatry, Munich, Germany
| | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Shearer AE, Hildebrand MS, Ravi H, Joshi S, Guiffre AC, Novak B, Happe S, LeProust EM, Smith RJH. Pre-capture multiplexing improves efficiency and cost-effectiveness of targeted genomic enrichment. BMC Genomics 2012; 13:618. [PMID: 23148716 PMCID: PMC3534602 DOI: 10.1186/1471-2164-13-618] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2012] [Accepted: 11/09/2012] [Indexed: 01/19/2023] Open
Abstract
Background Targeted genomic enrichment (TGE) is a widely used method for isolating and enriching specific genomic regions prior to massively parallel sequencing. To make effective use of sequencer output, barcoding and sample pooling (multiplexing) after TGE and prior to sequencing (post-capture multiplexing) has become routine. While previous reports have indicated that multiplexing prior to capture (pre-capture multiplexing) is feasible, no thorough examination of the effect of this method has been completed on a large number of samples. Here we compare standard post-capture TGE to two levels of pre-capture multiplexing: 12 or 16 samples per pool. We evaluated these methods using standard TGE metrics and determined the ability to identify several classes of genetic mutations in three sets of 96 samples, including 48 controls. Our overall goal was to maximize cost reduction and minimize experimental time while maintaining a high percentage of reads on target and a high depth of coverage at thresholds required for variant detection. Results We adapted the standard post-capture TGE method for pre-capture TGE with several protocol modifications, including redesign of blocking oligonucleotides and optimization of enzymatic and amplification steps. Pre-capture multiplexing reduced costs for TGE by at least 38% and significantly reduced hands-on time during the TGE protocol. We found that pre-capture multiplexing reduced capture efficiency by 23 or 31% for pre-capture pools of 12 and 16, respectively. However efficiency losses at this step can be compensated by reducing the number of simultaneously sequenced samples. Pre-capture multiplexing and post-capture TGE performed similarly with respect to variant detection of positive control mutations. In addition, we detected no instances of sample switching due to aberrant barcode identification. Conclusions Pre-capture multiplexing improves efficiency of TGE experiments with respect to hands-on time and reagent use compared to standard post-capture TGE. A decrease in capture efficiency is observed when using pre-capture multiplexing; however, it does not negatively impact variant detection and can be accommodated by the experimental design.
Collapse
Affiliation(s)
- A Eliot Shearer
- Department of Otolaryngology - Head & Neck Surgery, University of Iowa Carver College of Medicine, Iowa City, IA 52242, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Beerenwinkel N, Günthard HF, Roth V, Metzner KJ. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol 2012; 3:329. [PMID: 22973268 PMCID: PMC3438994 DOI: 10.3389/fmicb.2012.00329] [Citation(s) in RCA: 172] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2012] [Accepted: 08/24/2012] [Indexed: 12/17/2022] Open
Abstract
Many viruses, including the clinically relevant RNA viruses HIV (human immunodeficiency virus) and HCV (hepatitis C virus), exist in large populations and display high genetic heterogeneity within and between infected hosts. Assessing intra-patient viral genetic diversity is essential for understanding the evolutionary dynamics of viruses, for designing effective vaccines, and for the success of antiviral therapy. Next-generation sequencing (NGS) technologies allow the rapid and cost-effective acquisition of thousands to millions of short DNA sequences from a single sample. However, this approach entails several challenges in experimental design and computational data analysis. Here, we review the entire process of inferring viral diversity from sample collection to computing measures of genetic diversity. We discuss sample preparation, including reverse transcription and amplification, and the effect of experimental conditions on diversity estimates due to in vitro base substitutions, insertions, deletions, and recombination. The use of different NGS platforms and their sequencing error profiles are compared in the context of various applications of diversity estimation, ranging from the detection of single nucleotide variants (SNVs) to the reconstruction of whole-genome haplotypes. We describe the statistical and computational challenges arising from these technical artifacts, and we review existing approaches, including available software, for their solution. Finally, we discuss open problems, and highlight successful biomedical applications and potential future clinical use of NGS to estimate viral diversity.
Collapse
Affiliation(s)
- Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH ZurichBasel, Switzerland
- Swiss Institute of BioinformaticsBasel, Switzerland
| | - Huldrych F. Günthard
- Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, University of ZurichZurich, Switzerland
| | - Volker Roth
- Department of Mathematics and Computer Science, University of BaselBasel, Switzerland
| | - Karin J. Metzner
- Division of Infectious Diseases and Hospital Epidemiology, University Hospital Zurich, University of ZurichZurich, Switzerland
| |
Collapse
|
27
|
Gerstung M, Beisel C, Rechsteiner M, Wild P, Schraml P, Moch H, Beerenwinkel N. Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nat Commun 2012; 3:811. [PMID: 22549840 DOI: 10.1038/ncomms1814] [Citation(s) in RCA: 183] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2011] [Accepted: 03/30/2012] [Indexed: 01/06/2023] Open
|
28
|
Izawa K, Hijikata A, Tanaka N, Kawai T, Saito MK, Goldbach-Mansky R, Aksentijevich I, Yasumi T, Nakahata T, Heike T, Nishikomori R, Ohara O. Detection of base substitution-type somatic mosaicism of the NLRP3 gene with >99.9% statistical confidence by massively parallel sequencing. DNA Res 2012; 19:143-52. [PMID: 22279087 PMCID: PMC3325078 DOI: 10.1093/dnares/dsr047] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Chronic infantile neurological cutaneous and articular syndrome (CINCA), also known as neonatal-onset multisystem inflammatory disease (NOMID), is a dominantly inherited systemic autoinflammatory disease and is caused by a heterozygous germline gain-of-function mutation in the NLRP3 gene. We recently found a high incidence of NLRP3 somatic mosaicism in apparently mutation-negative CINCA/NOMID patients using subcloning and subsequent capillary DNA sequencing. It is important to rapidly diagnose somatic NLRP3 mosaicism to ensure proper treatment. However, this approach requires large investments of time, cost, and labour that prevent routine genetic diagnosis of low-level somatic NLRP3 mosaicism. We developed a routine pipeline to detect even a low-level allele of NLRP3 with statistical significance using massively parallel DNA sequencing. To address the critical concern of discriminating a low-level allele from sequencing errors, we first constructed error rate maps of 14 polymerase chain reaction products covering the entire coding NLRP3 exons on a Roche 454 GS-FLX sequencer from 50 control samples without mosaicism. Based on these results, we formulated a statistical confidence value for each sequence variation in each strand to discriminate sequencing errors from real genetic variation even in a low-level allele, and thereby detected base substitutions at an allele frequency as low as 1% with 99.9% or higher confidence.
Collapse
Affiliation(s)
- Kazushi Izawa
- Department of Pediatrics, Kyoto University Graduate School of Medicine, Kyoto, Japan
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Mullen MP, Creevey CJ, Berry DP, McCabe MS, Magee DA, Howard DJ, Killeen AP, Park SD, McGettigan PA, Lucy MC, Machugh DE, Waters SM. Polymorphism discovery and allele frequency estimation using high-throughput DNA sequencing of target-enriched pooled DNA samples. BMC Genomics 2012; 13:16. [PMID: 22235840 PMCID: PMC3315736 DOI: 10.1186/1471-2164-13-16] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2011] [Accepted: 01/11/2012] [Indexed: 11/10/2022] Open
Abstract
Background The central role of the somatotrophic axis in animal post-natal growth, development and fertility is well established. Therefore, the identification of genetic variants affecting quantitative traits within this axis is an attractive goal. However, large sample numbers are a pre-requisite for the identification of genetic variants underlying complex traits and although technologies are improving rapidly, high-throughput sequencing of large numbers of complete individual genomes remains prohibitively expensive. Therefore using a pooled DNA approach coupled with target enrichment and high-throughput sequencing, the aim of this study was to identify polymorphisms and estimate allele frequency differences across 83 candidate genes of the somatotrophic axis, in 150 Holstein-Friesian dairy bulls divided into two groups divergent for genetic merit for fertility. Results In total, 4,135 SNPs and 893 indels were identified during the resequencing of the 83 candidate genes. Nineteen percent (n = 952) of variants were located within 5' and 3' UTRs. Seventy-two percent (n = 3,612) were intronic and 9% (n = 464) were exonic, including 65 indels and 236 SNPs resulting in non-synonymous substitutions (NSS). Significant (P < 0.01) mean allele frequency differentials between the low and high fertility groups were observed for 720 SNPs (58 NSS). Allele frequencies for 43 of the SNPs were also determined by genotyping the 150 individual animals (Sequenom® MassARRAY). No significant differences (P > 0.1) were observed between the two methods for any of the 43 SNPs across both pools (i.e., 86 tests in total). Conclusions The results of the current study support previous findings of the use of DNA sample pooling and high-throughput sequencing as a viable strategy for polymorphism discovery and allele frequency estimation. Using this approach we have characterised the genetic variation within genes of the somatotrophic axis and related pathways, central to mammalian post-natal growth and development and subsequent lactogenesis and fertility. We have identified a large number of variants segregating at significantly different frequencies between cattle groups divergent for calving interval plausibly harbouring causative variants contributing to heritable variation. To our knowledge, this is the first report describing sequencing of targeted genomic regions in any livestock species using groups with divergent phenotypes for an economically important trait.
Collapse
Affiliation(s)
- Michael P Mullen
- Animal and Bioscience Research Department, Animal and Grassland Research and Innovation Centre, Teagasc, Athenry, Galway, Ireland.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
Ding J, Bashashati A, Roth A, Oloumi A, Tse K, Zeng T, Haffari G, Hirst M, Marra MA, Condon A, Aparicio S, Shah SP. Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. ACTA ACUST UNITED AC 2011; 28:167-75. [PMID: 22084253 PMCID: PMC3259434 DOI: 10.1093/bioinformatics/btr629] [Citation(s) in RCA: 105] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Motivation: The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge. Results: We present the comparison of four standard supervised machine learning algorithms for the purpose of somatic SNV prediction in tumour/normal NGS experiments. To evaluate these approaches (random forest, Bayesian additive regression tree, support vector machine and logistic regression), we constructed 106 features representing 3369 candidate somatic SNVs from 48 breast cancer genomes, originally predicted with naive methods and subsequently revalidated to establish ground truth labels. We trained the classifiers on this data (consisting of 1015 true somatic mutations and 2354 non-somatic mutation positions) and conducted a rigorous evaluation of these methods using a cross-validation framework and hold-out test NGS data from both exome capture and whole genome shotgun platforms. All learning algorithms employing predictive discriminative approaches with feature selection improved the predictive accuracy over standard approaches by statistically significant margins. In addition, using unsupervised clustering of the ground truth ‘false positive’ predictions, we noted several distinct classes and present evidence suggesting non-overlapping sources of technical artefacts illuminating important directions for future study. Availability: Software called MutationSeq and datasets are available from http://compbio.bccrc.ca. Contact:saparicio@bccrc.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiarui Ding
- Department of Molecular Oncology, BC Cancer Agency, Vancouver, BC, Canada
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Wittler R, Chauve C. Consistency-based detection of potential tumor-specific deletions in matched normal/tumor genomes. BMC Bioinformatics 2011; 12 Suppl 9:S21. [PMID: 22152084 PMCID: PMC3283309 DOI: 10.1186/1471-2105-12-s9-s21] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Structural variations in human genomes, such as insertions, deletion, or rearrangements, play an important role in cancer development. Next-Generation Sequencing technologies have been central in providing ways to detect such variations. Most existing methods however are limited to the analysis of a single genome, and it is only recently that the comparison of closely related genomes has been considered. In particular, a few recent works considered the analysis of data sets obtained by sequencing both tumor and healthy tissues of the same cancer patient. In that context, the goal is to detect variations that are specific to exactly one of the genomes, for example to differentiate between patient-specific and tumor-specific variations. This is a difficult task, especially when facing the additional challenge of the possible contamination of healthy tissues by tumor cells and conversely. RESULTS In the current work, we analyzed a data set of paired-end short-reads, obtained by sequencing tumor tissues and healthy tissues, both from the same cancer patient. Based on a combinatorial notion of conflict between deletions, we show that in the tumor data, more deletions are predicted than there could actually be in a diploid genome. In contrast, the predictions for the data from normal tissues are almost conflict-free. We designed and applied a method, specific to the analysis of such pooled and contaminated data sets, to detect potential tumor-specific deletions. Our method takes the deletion calls from both data sets and assigns reads from the mixed tumor/normal data to the normal one with the goal to minimize the number of reads that need to be discarded to obtain a set of conflict-free deletion clusters. We observed that, on the specific data set we analyze, only a very small fraction of the reads needs to be discarded to obtain a set of consistent deletions. CONCLUSIONS We present a framework based on a rigorous definition of consistency between deletions and the assumption that the tumor sample also contains normal cells. A combined analysis of both data sets based on this model allowed a consistent explanation of almost all data, providing a detailed picture of candidate patient- and tumor-specific deletions.
Collapse
Affiliation(s)
- Roland Wittler
- Department of Mathematics, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
- Technische Fakultät, Universität Bielefeld, Bielefeld, 33594, Germany
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| |
Collapse
|