1
|
He GQ, Huang XX, Pei MS, Jin HY, Cheng YZ, Wei TL, Liu HN, Yu YH, Guo DL. Dissection of the Pearl of Csaba pedigree identifies key genomic segments related to early ripening in grape. PLANT PHYSIOLOGY 2023; 191:1153-1166. [PMID: 36440478 PMCID: PMC9922404 DOI: 10.1093/plphys/kiac539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 11/01/2022] [Indexed: 06/16/2023]
Abstract
Pearl of Csaba (PC) is a valuable backbone parent for early-ripening grapevine (Vitis vinifera) breeding, from which many excellent early ripening varieties have been bred. However, the genetic basis of the stable inheritance of its early ripening trait remains largely unknown. Here, the pedigree, consisting of 40 varieties derived from PC, was re-sequenced for an average depth of ∼30×. Combined with the resequencing data of 24 other late-ripening varieties, 5,795,881 high-quality single nucleotide polymorphisms (SNPs) were identified following a strict filtering pipeline. The population genetic analysis showed that these varieties could be distinguished clearly, and the pedigree was characterized by lower nucleotide diversity and stronger linkage disequilibrium than the non-pedigree varieties. The conserved haplotypes (CHs) transmitted in the pedigree were obtained via identity-by-descent analysis. Subsequently, the key genomic segments were identified based on the combination analysis of haplotypes, selective signatures, known ripening-related quantitative trait loci (QTLs), and transcriptomic data. The results demonstrated that varieties with a superior haplotype, H1, significantly (one-way ANOVA, P < 0.001) exhibited early grapevine berry development. Further analyses indicated that H1 encompassed VIT_16s0039g00720 encoding a folate/biopterin transporter protein (VvFBT) with a missense mutation. VvFBT was specifically and highly expressed during grapevine berry development, particularly at veraison. Exogenous folate treatment advanced the veraison of "Kyoho". This work uncovered core haplotypes and genomic segments related to the early ripening trait of PC and provided an important reference for the molecular breeding of early-ripening grapevine varieties.
Collapse
Affiliation(s)
- Guang-Qi He
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
- Henan Engineering Technology Research Center of Quality Regulation of Horticultural Plants, Henan University of Science and Technology, Luoyang 471023, China
| | - Xi-Xi Huang
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
| | - Mao-Song Pei
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
- Henan Engineering Technology Research Center of Quality Regulation of Horticultural Plants, Henan University of Science and Technology, Luoyang 471023, China
| | - Hui-Ying Jin
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
- Henan Engineering Technology Research Center of Quality Regulation of Horticultural Plants, Henan University of Science and Technology, Luoyang 471023, China
| | - Yi-Zhe Cheng
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
- Henan Engineering Technology Research Center of Quality Regulation of Horticultural Plants, Henan University of Science and Technology, Luoyang 471023, China
| | - Tong-Lu Wei
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
- Henan Engineering Technology Research Center of Quality Regulation of Horticultural Plants, Henan University of Science and Technology, Luoyang 471023, China
| | - Hai-Nan Liu
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
- Henan Engineering Technology Research Center of Quality Regulation of Horticultural Plants, Henan University of Science and Technology, Luoyang 471023, China
| | - Yi-He Yu
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
- Henan Engineering Technology Research Center of Quality Regulation of Horticultural Plants, Henan University of Science and Technology, Luoyang 471023, China
| | - Da-Long Guo
- College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471023, China
- Henan Engineering Technology Research Center of Quality Regulation of Horticultural Plants, Henan University of Science and Technology, Luoyang 471023, China
| |
Collapse
|
2
|
Huang X, Tatonetti N, LaRow K, Delgoffee B, Mayer J, Page D, Hebbring SJ. E-Pedigrees: a large-scale automatic family pedigree prediction application. Bioinformatics 2021; 37:3966-3968. [PMID: 34086863 PMCID: PMC8570807 DOI: 10.1093/bioinformatics/btab419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Revised: 04/30/2021] [Accepted: 06/03/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The use and functionality of Electronic Health Records (EHR) have increased rapidly in the past few decades. EHRs are becoming an important depository of patient health information and can capture family data. Pedigree analysis is a longstanding and powerful approach that can gain insight into the underlying genetic and environmental factors in human health, but traditional approaches to identifying and recruiting families are low-throughput and labor-intensive. Therefore, high-throughput methods to automatically construct family pedigrees are needed. RESULTS We developed a stand-alone application: Electronic Pedigrees, or E-Pedigrees, which combines two validated family prediction algorithms into a single software package for high throughput pedigrees construction. The convenient platform considers patients' basic demographic information and/or emergency contact data to infer high-accuracy parent-child relationship. Importantly, E-Pedigrees allows users to layer in additional pedigree data when available and provides options for applying different logical rules to improve accuracy of inferred family relationships. This software is fast and easy to use, is compatible with different EHR data sources, and its output is a standard PED file appropriate for multiple downstream analyses. AVAILABILITY AND IMPLEMENTATION The Python 3.3+ version E-Pedigrees application is freely available on: https://github.com/xiayuan-huang/E-pedigrees.
Collapse
Affiliation(s)
- Xiayuan Huang
- Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Nicholas Tatonetti
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Katie LaRow
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Brooke Delgoffee
- Office of Research Computing and Analytics, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| | - John Mayer
- Office of Research Computing and Analytics, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| | - David Page
- Department of Biostatistics & Bioinformatics, Duke University, Durham, NC 27710, USA
| | - Scott J Hebbring
- Center for Precision Medicine Research, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA
| |
Collapse
|
3
|
Galla SJ, Brown L, Couch-Lewis Ngāi Tahu Te Hapū O Ngāti Wheke Ngāti Waewae Y, Cubrinovska I, Eason D, Gooley RM, Hamilton JA, Heath JA, Hauser SS, Latch EK, Matocq MD, Richardson A, Wold JR, Hogg CJ, Santure AW, Steeves TE. The relevance of pedigrees in the conservation genomics era. Mol Ecol 2021; 31:41-54. [PMID: 34553796 PMCID: PMC9298073 DOI: 10.1111/mec.16192] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Revised: 09/12/2021] [Accepted: 09/17/2021] [Indexed: 01/21/2023]
Abstract
Over the past 50 years conservation genetics has developed a substantive toolbox to inform species management. One of the most long‐standing tools available to manage genetics—the pedigree—has been widely used to characterize diversity and maximize evolutionary potential in threatened populations. Now, with the ability to use high throughput sequencing to estimate relatedness, inbreeding, and genome‐wide functional diversity, some have asked whether it is warranted for conservation biologists to continue collecting and collating pedigrees for species management. In this perspective, we argue that pedigrees remain a relevant tool, and when combined with genomic data, create an invaluable resource for conservation genomic management. Genomic data can address pedigree pitfalls (e.g., founder relatedness, missing data, uncertainty), and in return robust pedigrees allow for more nuanced research design, including well‐informed sampling strategies and quantitative analyses (e.g., heritability, linkage) to better inform genomic inquiry. We further contend that building and maintaining pedigrees provides an opportunity to strengthen trusted relationships among conservation researchers, practitioners, Indigenous Peoples, and Local Communities.
Collapse
Affiliation(s)
- Stephanie J Galla
- Department of Biological Sciences, Boise State University, Boise, Idaho, USA.,School of Biological Sciences, University of Canterbury, Christchurch, Canterbury, New Zealand
| | - Liz Brown
- New Zealand Department of Conservation, Twizel, Canterbury, New Zealand
| | | | - Ilina Cubrinovska
- School of Biological Sciences, University of Canterbury, Christchurch, Canterbury, New Zealand
| | - Daryl Eason
- New Zealand Department of Conservation, Invercargill, Southland, New Zealand
| | - Rebecca M Gooley
- Smithsonian-Mason School of Conservation, Front Royal, Maryland, USA.,Center for Species Survival, Smithsonian Conservation Biology Institute, National Zoological Park, Washington, District of Columbia, USA
| | - Jill A Hamilton
- Department of Biological Sciences, North Dakota State University, Fargo, North Dakota, USA
| | - Julie A Heath
- Department of Biological Sciences, Boise State University, Boise, Idaho, USA
| | - Samantha S Hauser
- Department of Biological Sciences, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
| | - Emily K Latch
- Department of Biological Sciences, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
| | - Marjorie D Matocq
- Department of Natural Resources and Environmental Science, Program in Ecology, Evolution and Conservation Biology, University of Nevada Reno, Reno, Nevada, USA
| | - Anne Richardson
- The Isaac Conservation and Wildlife Trust, Christchurch, Canterbury, New Zealand
| | - Jana R Wold
- School of Biological Sciences, University of Canterbury, Christchurch, Canterbury, New Zealand
| | - Carolyn J Hogg
- School of Life and Environmental Sciences, University of Sydney, Sydney, NSW, Australia
| | - Anna W Santure
- School of Biological Sciences, University of Auckland, Auckland, Auckland, New Zealand
| | - Tammy E Steeves
- School of Biological Sciences, University of Canterbury, Christchurch, Canterbury, New Zealand
| |
Collapse
|
4
|
Liu Y, Wu X, Wang Y. An integrated approach for copy number variation discovery in parent-offspring trios. Brief Bioinform 2021; 22:6306464. [PMID: 34151932 DOI: 10.1093/bib/bbab230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 04/27/2021] [Accepted: 05/25/2021] [Indexed: 11/14/2022] Open
Abstract
Whole-genome sequencing (WGS) of parent-offspring trios has become widely used to identify causal copy number variations (CNVs) in rare and complex diseases. Existing CNV detection approaches usually do not make effective use of Mendelian inheritance in parent-offspring trios and yield low accuracy. In this study, we propose a novel integrated approach, TrioCNV2, for jointly detecting CNVs from WGS data of the parent-offspring trio. TrioCNV2 first makes use of the read depth and discordant read pairs to infer approximate locations of CNVs and then employs the split read and local de novo assembly approaches to refine the breakpoints. We use the real WGS data of two parent-offspring trios to demonstrate TrioCNV2's performance and compare it with other CNV detection approaches. The software TrioCNV2 is implemented using a combination of Java and R and is freely available from the website at https://github.com/yongzhuang/TrioCNV2.
Collapse
Affiliation(s)
- Yongzhuang Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Xiaoliang Wu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
5
|
Yan Z, Zhu X, Wang Y, Nie Y, Guan S, Kuo Y, Chang D, Li R, Qiao J, Yan L. scHaplotyper: haplotype construction and visualization for genetic diagnosis using single cell DNA sequencing data. BMC Bioinformatics 2020; 21:41. [PMID: 32007105 PMCID: PMC6995221 DOI: 10.1186/s12859-020-3381-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2019] [Accepted: 01/22/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Haplotyping reveals chromosome blocks inherited from parents to in vitro fertilized (IVF) embryos in preimplantation genetic diagnosis (PGD), enabling the observation of the transmission of disease alleles between generations. However, the methods of haplotyping that are suitable for single cells are limited because a whole genome amplification (WGA) process is performed before sequencing or genotyping in PGD, and true haplotype profiles of embryos need to be constructed based on genotypes that can contain many WGA artifacts. RESULTS Here, we offer scHaplotyper as a genetic diagnosis tool that reconstructs and visualizes the haplotype profiles of single cells based on the Hidden Markov Model (HMM). scHaplotyper can trace the origin of each haplotype block in the embryo, enabling the detection of carrier status of disease alleles in each embryo. We applied this method in PGD in two families affected with genetic disorders, and the result was the healthy live births of two children in the two families, demonstrating the clinical application of this method. CONCLUSION Next generation sequencing (NGS) of preimplantation embryos enable genetic screening for families with genetic disorders, avoiding the birth of affected babies. With the validation and successful clinical application, we showed that scHaplotyper is a convenient and accurate method to screen out embryos. More patients with genetic disorder will benefit from the genetic diagnosis of embryos. The source code of scHaplotyper is available at GitHub repository: https://github.com/yzqheart/scHaplotyper.
Collapse
Affiliation(s)
- Zhiqiang Yan
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China.,Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, 100871, China.,Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Xiaohui Zhu
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Yuqian Wang
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Yanli Nie
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Shuo Guan
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Ying Kuo
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Di Chang
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Rong Li
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China
| | - Jie Qiao
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China.,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China.,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China.,Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, 100871, China.,Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China.,Beijing Advanced Innovation Center for Genomics (ICG), Peking University, Beijing, 100871, China
| | - Liying Yan
- Center for Reproductive Medicine, Department of Obstetrics and Gynecology, Peking University Third Hospital, Beijing, 100191, China. .,Key Laboratory of Assisted Reproduction, Ministry of Education, Beijing, 100191, China. .,Beijing Key Laboratory of Reproductive Endocrinology and Assisted Reproduction, Beijing, 100191, China.
| |
Collapse
|
6
|
Kómár P, Kural D. geck: trio-based comparative benchmarking of variant calls. Bioinformatics 2019; 34:3488-3495. [PMID: 29850774 PMCID: PMC6184596 DOI: 10.1093/bioinformatics/bty415] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2017] [Accepted: 05/22/2018] [Indexed: 12/30/2022] Open
Abstract
Motivation Classical methods of comparing the accuracies of variant calling pipelines are based on truth sets of variants whose genotypes are previously determined with high confidence. An alternative way of performing benchmarking is based on Mendelian constraints between related individuals. Statistical analysis of Mendelian violations can provide truth set-independent benchmarking information, and enable benchmarking less-studied variants and diverse populations. Results We introduce a statistical mixture model for comparing two variant calling pipelines from genotype data they produce after running on individual members of a trio. We determine the accuracy of our model by comparing the precision and recall of GATK Unified Genotyper and Haplotype Caller on the high-confidence SNPs of the NIST Ashkenazim trio and the two independent Platinum Genome trios. We show that our method is able to estimate differential precision and recall between the two pipelines with 10-3 uncertainty. Availability and implementation The Python library geck, and usage examples are available at the following URL: https://github.com/sbg/geck, under the GNU General Public License v3. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
7
|
Kothiyal P, Wong WSW, Bodian DL, Niederhuber JE. Mendelian Inconsistent Signatures from 1314 Ancestrally Diverse Family Trios Distinguish Biological Variation from Sequencing Error. J Comput Biol 2019; 26:405-419. [PMID: 30942611 PMCID: PMC6533806 DOI: 10.1089/cmb.2018.0253] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Next-generation sequencing enables advances in the clinical application of genomics by providing high-throughput detection of genomic variation. However, next-generation sequencing technologies, especially whole-genome sequencing (WGS), are often associated with a high false-positive rate. Trio-based WGS can contribute significantly towards improved quality control methods. Mendelian-inconsistent calls (MIC) in parent–child trios are commonly attributed to erroneous sequencing calls, as the true de novo mutation rate is extremely low compared with MIC incidence. Here, we analyzed WGS data from 1314 mother, father, and child trios across ethnically diverse populations with the goal of characterizing MIC. Genotype calls in a trio can be used to assign different signatures to MIC. MIC occur more frequently within repeats but show varying distribution and error mechanisms across repeat types. MIC are enriched within poly-A/T runs in short interspersed nuclear elements. Alignability scores, allele balance, and relative parental read depth vary among MIC signatures and these differences should be considered when designing filters for MIC reduction. MIC cluster in germline deletions and these MIC also segregate with population. Our results provide a basis for making decisions on how each MIC type should be evaluated before discarding them as errors or including them in alternative applications. With the reduction of sequencing cost, family trio whole genome and exome analysis are being performed more routinely in clinical practice. We provide a reference that can be used for annotating MIC with their frequencies in a larger population to aid in the filtering of candidate de novo mutations.
Collapse
Affiliation(s)
- Prachi Kothiyal
- 1 Inova Translational Medicine Institute, Inova Health System, Falls Church, Virginia
| | - Wendy S W Wong
- 1 Inova Translational Medicine Institute, Inova Health System, Falls Church, Virginia
| | - Dale L Bodian
- 1 Inova Translational Medicine Institute, Inova Health System, Falls Church, Virginia
| | - John E Niederhuber
- 1 Inova Translational Medicine Institute, Inova Health System, Falls Church, Virginia.,2 Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, Virginia
| |
Collapse
|
8
|
Whalen A, Ros-Freixedes R, Wilson DL, Gorjanc G, Hickey JM. Hybrid peeling for fast and accurate calling, phasing, and imputation with sequence data of any coverage in pedigrees. Genet Sel Evol 2018; 50:67. [PMID: 30563452 PMCID: PMC6299538 DOI: 10.1186/s12711-018-0438-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Accepted: 12/11/2018] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND In this paper, we extend multi-locus iterative peeling to provide a computationally efficient method for calling, phasing, and imputing sequence data of any coverage in small or large pedigrees. Our method, called hybrid peeling, uses multi-locus iterative peeling to estimate shared chromosome segments between parents and their offspring at a subset of loci, and then uses single-locus iterative peeling to aggregate genomic information across multiple generations at the remaining loci. RESULTS Using a synthetic dataset, we first analysed the performance of hybrid peeling for calling and phasing genotypes in disconnected families, which contained only a focal individual and its parents and grandparents. Second, we analysed the performance of hybrid peeling for calling and phasing genotypes in the context of a full general pedigree. Third, we analysed the performance of hybrid peeling for imputing whole-genome sequence data to non-sequenced individuals in the population. We found that hybrid peeling substantially increased the number of called and phased genotypes by leveraging sequence information on related individuals. The calling rate and accuracy increased when the full pedigree was used compared to a reduced pedigree of just parents and grandparents. Finally, hybrid peeling imputed accurately whole-genome sequence to non-sequenced individuals. CONCLUSIONS We believe that this algorithm will enable the generation of low cost and high accuracy whole-genome sequence data in many pedigreed populations. We make this algorithm available as a standalone program called AlphaPeel.
Collapse
Affiliation(s)
- Andrew Whalen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - David L. Wilson
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - John M. Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| |
Collapse
|
9
|
Zhou X, Batzoglou S, Sidow A, Zhang L. HAPDeNovo: a haplotype-based approach for filtering and phasing de novo mutations in linked read sequencing data. BMC Genomics 2018; 19:467. [PMID: 29914369 PMCID: PMC6006847 DOI: 10.1186/s12864-018-4867-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2017] [Accepted: 06/13/2018] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND De novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls. RESULTS To address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM. HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80 to 99% of false positives regardless of how large the candidate DNM set is. CONCLUSIONS HAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.
Collapse
Affiliation(s)
- Xin Zhou
- Department of Computer Science, Stanford University, Stanford, California, 94305, USA
| | - Serafim Batzoglou
- Department of Computer Science, Stanford University, Stanford, California, 94305, USA
| | - Arend Sidow
- Department of Pathology, Stanford University School of Medicine, Stanford, California, 94305, USA.,Department of Genetics, Stanford University School of Medicine, Stanford, California, 94305, USA
| | - Lu Zhang
- Department of Computer Science, Stanford University, Stanford, California, 94305, USA. .,Department of Pathology, Stanford University School of Medicine, Stanford, California, 94305, USA.
| |
Collapse
|
10
|
Jin ZB, Li Z, Liu Z, Jiang Y, Cai XB, Wu J. Identification of de novo germline mutations and causal genes for sporadic diseases using trio-based whole-exome/genome sequencing. Biol Rev Camb Philos Soc 2017; 93:1014-1031. [PMID: 29154454 DOI: 10.1111/brv.12383] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2017] [Revised: 09/28/2017] [Accepted: 10/10/2017] [Indexed: 12/14/2022]
Abstract
Whole-genome or whole-exome sequencing (WGS/WES) of the affected proband together with normal parents (trio) is commonly adopted to identify de novo germline mutations (DNMs) underlying sporadic cases of various genetic disorders. However, our current knowledge of the occurrence and functional effects of DNMs remains limited and accurately identifying the disease-causing DNM from a group of irrelevant DNMs is complicated. Herein, we provide a general-purpose discussion of important issues related to pathogenic gene identification based on trio-based WGS/WES data. Specifically, the relevance of DNMs to human sporadic diseases, current knowledge of DNM biogenesis mechanisms, and common strategies or software tools used for DNM detection are reviewed, followed by a discussion of pathogenic gene prioritization. In addition, several key factors that may affect DNM identification accuracy and causal gene prioritization are reviewed. Based on recent major advances, this review both sheds light on how trio-based WGS/WES technologies can play a significant role in the identification of DNMs and causal genes for sporadic diseases, and also discusses existing challenges.
Collapse
Affiliation(s)
- Zi-Bing Jin
- Division of Ophthalmic Genetics, The Eye Hospital, School of Ophthalmology & Optometry, Wenzhou Medical University, Wenzhou, 325027, China.,State Key Laboratory of Ophthalmology Optometry and Vision Science, Wenzhou Medical University, Wenzhou, 325027, China
| | - Zhongshan Li
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, 325000, China
| | - Zhenwei Liu
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, 325000, China
| | - Yi Jiang
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, 325000, China
| | - Xue-Bi Cai
- Division of Ophthalmic Genetics, The Eye Hospital, School of Ophthalmology & Optometry, Wenzhou Medical University, Wenzhou, 325027, China.,State Key Laboratory of Ophthalmology Optometry and Vision Science, Wenzhou Medical University, Wenzhou, 325027, China
| | - Jinyu Wu
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, 325000, China
| |
Collapse
|
11
|
Analysis of population-specific pharmacogenomic variants using next-generation sequencing data. Sci Rep 2017; 7:8416. [PMID: 28871186 PMCID: PMC5583360 DOI: 10.1038/s41598-017-08468-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Accepted: 07/11/2017] [Indexed: 02/03/2023] Open
Abstract
Functional rare variants in drug-related genes are believed to be highly differentiated between ethnic- or racial populations. However, knowledge of population differentiation (PD) of rare single-nucleotide variants (SNVs), remains widely lacking, with the highest fixation indices, (Fst values), from both rare and common variants annotated to specific genes, having only been marginally used to understand PD at the gene level. In this study, we suggest a new, gene-based PD method, PD of Rare and Common variants (PDRC), for analyzing rare variants, as inspired by Generalized Cochran-Mantel-Haenszel (GCMH) statistics, to identify highly population-differentiated drug response-related genes (“pharmacogenes”). Through simulation studies, we reveal that PDRC adequately summarizes rare and common variants, due to PD, over a specific gene. We also applied the proposed method to a real whole-exome sequencing dataset, consisting of 10,000 datasets, from the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in multi-Ethnic Samples (T2D-GENES) initiative, and 3,000 datasets from the Genetics of Type 2 diabetes (Go-T2D) repository. Among the 48 genes annotated with Very Important Pharmacogenetic summaries (VIPgenes), in the PharmGKB database, our PD method successfully identified candidate genes with high PD, including ACE, CYP2B6, DPYD, F5, MTHFR, and SCN5A.
Collapse
|
12
|
Abstract
MOTIVATION Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information-reads and pedigree-has the potential to deliver results better than each individually. RESULTS We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2× for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15× coverage per individual. AVAILABILITY AND IMPLEMENTATION https://bitbucket.org/whatshap/whatshap CONTACT t.marschall@mpi-inf.mpg.de.
Collapse
Affiliation(s)
- Shilpa Garg
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarbrücken, Germany Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
| | - Marcel Martin
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, SE-17121 Solna, Sweden
| | - Tobias Marschall
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarbrücken, Germany
| |
Collapse
|
13
|
Increasing Generality and Power of Rare-Variant Tests by Utilizing Extended Pedigrees. Am J Hum Genet 2016; 99:846-859. [PMID: 27666371 DOI: 10.1016/j.ajhg.2016.08.015] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2016] [Accepted: 08/17/2016] [Indexed: 11/24/2022] Open
Abstract
Recently, multiple studies have performed whole-exome or whole-genome sequencing to identify groups of rare variants associated with complex traits and diseases. They have primarily utilized case-control study designs that often require thousands of individuals to reach acceptable statistical power. Family-based studies can be more powerful because a rare variant can be enriched in an extended pedigree and segregate with the phenotype. Although many methods have been proposed for using family data to discover rare variants involved in a disease, a majority of them focus on a specific pedigree structure and are designed to analyze either binary or continuously measured outcomes. In this article, we propose RareIBD, a general and powerful approach to identifying rare variants involved in disease susceptibility. Our method can be applied to large extended families of arbitrary structure, including pedigrees with only affected individuals. The method accommodates both binary and quantitative traits. A series of simulation experiments suggest that RareIBD is a powerful test that outperforms existing approaches. In addition, our method accounts for individuals in top generations, which are not usually genotyped in extended families. In contrast to available statistical tests, RareIBD generates accurate p values even when genetic data from these individuals are missing. We applied RareIBD, as well as other methods, to two extended family datasets generated by different genotyping technologies and representing different ethnicities. The analysis of real data confirmed that RareIBD is the only method that properly controls type I error.
Collapse
|
14
|
Chang LC, Li B, Fang Z, Vrieze S, McGue M, Iacono WG, Tseng GC, Chen W. A computational method for genotype calling in family-based sequencing data. BMC Bioinformatics 2016; 17:37. [PMID: 26772743 PMCID: PMC4715317 DOI: 10.1186/s12859-016-0880-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2015] [Accepted: 01/06/2016] [Indexed: 12/12/2022] Open
Abstract
Background As sequencing technologies can help researchers detect common and rare variants across the human genome in many individuals, it is known that jointly calling genotypes across multiple individuals based on linkage disequilibrium (LD) can facilitate the analysis of low to modest coverage sequence data. However, genotype-calling methods for family-based sequence data, particularly for complex families beyond parent-offspring trios, are still lacking. Results In this study, first, we proposed an algorithm that considers both linkage disequilibrium (LD) patterns and familial transmission in nuclear and multi-generational families while retaining the computational efficiency. Second, we extended our method to incorporate external reference panels to analyze family-based sequence data with a small sample size. In simulation studies, we show that modeling multiple offspring can dramatically increase genotype calling accuracy and reduce phasing and Mendelian errors, especially at low to modest coverage. In addition, we show that using external panels can greatly facilitate genotype calling of sequencing data with a small number of individuals. We applied our method to a whole genome sequencing study of 1339 individuals at ~10X coverage from the Minnesota Center for Twin and Family Research. Conclusions The aggregated results show that our methods significantly outperform existing ones that ignore family constraints or LD information. We anticipate that our method will be useful for many ongoing family-based sequencing projects. We have implemented our methods efficiently in a C++ program FamLDCaller, which is available from http://www.pitt.edu/~wec47/famldcaller.html. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0880-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Lun-Ching Chang
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, 20892, USA.
| | - Bingshan Li
- Department of Molecular Physiology & Biophysics, Vanderbilt University Medical Center, Nashville, TN, 37232, USA.
| | - Zhou Fang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, 15261, USA.
| | - Scott Vrieze
- Department of Psychology & Neuroscience, Institute for Behavioral Genetics, University of Colorado, Boulder, CO, 80309, USA.
| | - Matt McGue
- Department of Psychology, University of Minnesota, Minneapolis, MN, 55455, USA.
| | - William G Iacono
- Department of Psychology, University of Minnesota, Minneapolis, MN, 55455, USA.
| | - George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, 15261, USA.
| | - Wei Chen
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, 15261, USA. .,Division of Pulmonary Medicine, Allergy and Immunology, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, PA, 15224, USA.
| |
Collapse
|
15
|
Mikhchi A, Honarvar M, Emam Jomeh Kashan N, Zerehdaran S, Aminafshar M. Comparison of three boosting methods in parent-offspring trios for genotype imputation using simulation study. JOURNAL OF ANIMAL SCIENCE AND TECHNOLOGY 2016; 58:1. [PMID: 26740888 PMCID: PMC4702368 DOI: 10.1186/s40781-015-0081-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/18/2015] [Accepted: 12/28/2015] [Indexed: 11/30/2022]
Abstract
Background Genotype imputation is an important process of predicting unknown genotypes, which uses reference population with dense genotypes to predict missing genotypes for both human and animal genetic variations at a low cost. Machine learning methods specially boosting methods have been used in genetic studies to explore the underlying genetic profile of disease and build models capable of predicting missing values of a marker. Methods In this study strategies and factors affecting the imputation accuracy of parent-offspring trios compared from lower-density SNP panels (5 K) to high density (10 K) SNP panel using three different Boosting methods namely TotalBoost (TB), LogitBoost (LB) and AdaBoost (AB). The methods employed using simulated data to impute the un-typed SNPs in parent-offspring trios. Four different datasets of G1 (100 trios with 5 k SNPs), G2 (100 trios with 10 k SNPs), G3 (500 trios with 5 k SNPs), and G4 (500 trio with 10 k SNPs) were simulated. In four datasets all parents were genotyped completely, and offspring genotyped with a lower density panel. Results Comparison of the three methods for imputation showed that the LB outperformed AB and TB for imputation accuracy. The time of computation were different between methods. The AB was the fastest algorithm. The higher SNP densities resulted the increase of the accuracy of imputation. Larger trios (i.e. 500) was better for performance of LB and TB. Conclusions The conclusion is that the three methods do well in terms of imputation accuracy also the dense chip is recommended for imputation of parent-offspring trios.
Collapse
Affiliation(s)
- Abbas Mikhchi
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Mahmood Honarvar
- Department of Animal Science, Shahr-e-Qods Branch, Islamic Azad University, Tehran, Iran
| | - Nasser Emam Jomeh Kashan
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Saeed Zerehdaran
- Department of Animal Science, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mehdi Aminafshar
- Department of Animal Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
| |
Collapse
|
16
|
Liu Y, Liu J, Lu J, Peng J, Juan L, Zhu X, Li B, Wang Y. Joint detection of copy number variations in parent-offspring trios. Bioinformatics 2015; 32:1130-7. [PMID: 26644415 DOI: 10.1093/bioinformatics/btv707] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 11/27/2015] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Whole genome sequencing (WGS) of parent-offspring trios is a powerful approach for identifying disease-associated genes via detecting copy number variations (CNVs). Existing approaches, which detect CNVs for each individual in a trio independently, usually yield low-detection accuracy. Joint modeling approaches leveraging Mendelian transmission within the parent-offspring trio can be an efficient strategy to improve CNV detection accuracy. RESULTS In this study, we developed TrioCNV, a novel approach for jointly detecting CNVs in parent-offspring trios from WGS data. Using negative binomial regression, we modeled the read depth signal while considering both GC content bias and mappability bias. Moreover, we incorporated the family relationship and used a hidden Markov model to jointly infer CNVs for three samples of a parent-offspring trio. Through application to both simulated data and a trio from 1000 Genomes Project, we showed that TrioCNV achieved superior performance than existing approaches. AVAILABILITY AND IMPLEMENTATION The software TrioCNV implemented using a combination of Java and R is freely available from the website at https://github.com/yongzhuang/TrioCNV CONTACT: ydwang@hit.edu.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yongzhuang Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Jian Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Jianguo Lu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Jiajie Peng
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Liran Juan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Xiaolin Zhu
- Institute for Genomic Medicine, Columbia University, New York, NY 10032, University Program in Genetics and Genomics, Duke University Medical School, Durham, NC 27708
| | - Bingshan Li
- Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37235 and Center for Quantitative Sciences, Vanderbilt University, Nashville, TN 37235, USA
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
17
|
Yan S, Yuan S, Xu Z, Zhang B, Zhang B, Kang G, Byrnes A, Li Y. Likelihood-based complex trait association testing for arbitrary depth sequencing data. Bioinformatics 2015; 31:2955-62. [PMID: 25979475 PMCID: PMC4668777 DOI: 10.1093/bioinformatics/btv307] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2014] [Revised: 05/06/2015] [Accepted: 05/11/2015] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED In next generation sequencing (NGS)-based genetic studies, researchers typically perform genotype calling first and then apply standard genotype-based methods for association testing. However, such a two-step approach ignores genotype calling uncertainty in the association testing step and may incur power loss and/or inflated type-I error. In the recent literature, a few robust and efficient likelihood based methods including both likelihood ratio test (LRT) and score test have been proposed to carry out association testing without intermediate genotype calling. These methods take genotype calling uncertainty into account by directly incorporating genotype likelihood function (GLF) of NGS data into association analysis. However, existing LRT methods are computationally demanding or do not allow covariate adjustment; while existing score tests are not applicable to markers with low minor allele frequency (MAF). We provide an LRT allowing flexible covariate adjustment, develop a statistically more powerful score test and propose a combination strategy (UNC combo) to leverage the advantages of both tests. We have carried out extensive simulations to evaluate the performance of our proposed LRT and score test. Simulations and real data analysis demonstrate the advantages of our proposed combination strategy: it offers a satisfactory trade-off in terms of computational efficiency, applicability (accommodating both common variants and variants with low MAF) and statistical power, particularly for the analysis of quantitative trait where the power gain can be up to ∼60% when the causal variant is of low frequency (MAF < 0.01). AVAILABILITY AND IMPLEMENTATION UNC combo and the associated R files, including documentation, examples, are available at http://www.unc.edu/∼yunmli/UNCcombo/ CONTACT yunli@med.unc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Song Yan
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Shuai Yuan
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Zheng Xu
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Baqun Zhang
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Bo Zhang
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Guolian Kang
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Andrea Byrnes
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| | - Yun Li
- Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA Department of Biostatistics, Department of Genetics, Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599 USA, Merck Research Laboratories, North Wales, PA, USA, School of Statistics, Renmin University of China, Beijing, People's Republic of China, Department of Statistics, North Carolina State University, Raleigh, NC, 27607 USA, Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA and Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA
| |
Collapse
|
18
|
Sidore C, Busonero F, Maschio A, Porcu E, Naitza S, Zoledziewska M, Mulas A, Pistis G, Steri M, Danjou F, Kwong A, Ortega Del Vecchyo VD, Chiang CWK, Bragg-Gresham J, Pitzalis M, Nagaraja R, Tarrier B, Brennan C, Uzzau S, Fuchsberger C, Atzeni R, Reinier F, Berutti R, Huang J, Timpson NJ, Toniolo D, Gasparini P, Malerba G, Dedoussis G, Zeggini E, Soranzo N, Jones C, Lyons R, Angius A, Kang HM, Novembre J, Sanna S, Schlessinger D, Cucca F, Abecasis GR. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat Genet 2015; 47:1272-1281. [PMID: 26366554 PMCID: PMC4627508 DOI: 10.1038/ng.3368] [Citation(s) in RCA: 155] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2014] [Accepted: 07/06/2015] [Indexed: 12/31/2022]
Abstract
We report ∼17.6 million genetic variants from whole-genome sequencing of 2,120 Sardinians; 22% are absent from previous sequencing-based compilations and are enriched for predicted functional consequences. Furthermore, ∼76,000 variants common in our sample (frequency >5%) are rare elsewhere (<0.5% in the 1000 Genomes Project). We assessed the impact of these variants on circulating lipid levels and five inflammatory biomarkers. We observe 14 signals, including 2 major new loci, for lipid levels and 19 signals, including 2 new loci, for inflammatory markers. The new associations would have been missed in analyses based on 1000 Genomes Project data, underlining the advantages of large-scale sequencing in this founder population.
Collapse
Affiliation(s)
- Carlo Sidore
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy.,Center for Statistical Genetics, Ann Arbor, University of Michigan, MI, USA.,Università degli Studi di Sassari, Sassari, Italy
| | - Fabio Busonero
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy.,Center for Statistical Genetics, Ann Arbor, University of Michigan, MI, USA.,University of Michigan, DNA Sequencing Core, Ann Arbor, MI, USA
| | - Andrea Maschio
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy.,Center for Statistical Genetics, Ann Arbor, University of Michigan, MI, USA.,University of Michigan, DNA Sequencing Core, Ann Arbor, MI, USA
| | - Eleonora Porcu
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy.,Center for Statistical Genetics, Ann Arbor, University of Michigan, MI, USA.,Università degli Studi di Sassari, Sassari, Italy
| | - Silvia Naitza
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy
| | | | - Antonella Mulas
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy.,Università degli Studi di Sassari, Sassari, Italy
| | - Giorgio Pistis
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy.,Center for Statistical Genetics, Ann Arbor, University of Michigan, MI, USA.,Università degli Studi di Sassari, Sassari, Italy
| | - Maristella Steri
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy
| | - Fabrice Danjou
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy
| | - Alan Kwong
- Center for Statistical Genetics, Ann Arbor, University of Michigan, MI, USA
| | | | - Charleston W K Chiang
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA, USA
| | | | | | - Ramaiah Nagaraja
- Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Brendan Tarrier
- University of Michigan, DNA Sequencing Core, Ann Arbor, MI, USA
| | | | - Sergio Uzzau
- Porto Conte Ricerche srl, Tramariglio, Alghero, 07041 Italy
| | | | - Rossano Atzeni
- Center for Advanced Studies, Research, and Development in Sardinia (CRS4), AGCT Program, Parco Scientifico e tecnologico della Sardegna, Pula, Italy
| | - Frederic Reinier
- Center for Advanced Studies, Research, and Development in Sardinia (CRS4), AGCT Program, Parco Scientifico e tecnologico della Sardegna, Pula, Italy
| | - Riccardo Berutti
- Università degli Studi di Sassari, Sassari, Italy.,Center for Advanced Studies, Research, and Development in Sardinia (CRS4), AGCT Program, Parco Scientifico e tecnologico della Sardegna, Pula, Italy
| | - Jie Huang
- Human Genetics, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1HH
| | - Nicholas J Timpson
- MRC Integrative Epidemiology Unit at the University of Bristol, University of Bristol, Bristol, United Kingdom
| | - Daniela Toniolo
- Division of Genetics and Cell Biology, San Raffaele Scientific Institute, Milano, Italy
| | - Paolo Gasparini
- DSM-University of Trieste and IRCCS-Burlo Garofolo Children Hospital (Trieste, Italy).,Experimental Genetics Division, Sidra, (Doha, Qatar)
| | - Giovanni Malerba
- Department of Life and Reproduction Sciences, University of Verona, Verona, Italy
| | | | - Eleftheria Zeggini
- Human Genetics, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1HH
| | - Nicole Soranzo
- Human Genetics, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1HH.,Department of Haematology, University of Cambridge, Hills Rd, Cambridge CB2 0AH
| | - Chris Jones
- Center for Advanced Studies, Research, and Development in Sardinia (CRS4), AGCT Program, Parco Scientifico e tecnologico della Sardegna, Pula, Italy
| | - Robert Lyons
- University of Michigan, DNA Sequencing Core, Ann Arbor, MI, USA
| | - Andrea Angius
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy.,Center for Advanced Studies, Research, and Development in Sardinia (CRS4), AGCT Program, Parco Scientifico e tecnologico della Sardegna, Pula, Italy
| | - Hyun M Kang
- Center for Statistical Genetics, Ann Arbor, University of Michigan, MI, USA
| | - John Novembre
- Department of Human Genetics, University of Chicago, IL, USA
| | - Serena Sanna
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy
| | - David Schlessinger
- Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Francesco Cucca
- Istituto di Ricerca Genetica e Biomedica, CNR, Monserrato, Cagliari, Italy.,Università degli Studi di Sassari, Sassari, Italy
| | - Gonçalo R Abecasis
- Center for Statistical Genetics, Ann Arbor, University of Michigan, MI, USA
| |
Collapse
|
19
|
Li W, Fu G, Rao W, Xu W, Ma L, Guo S, Song Q. GenomeLaser: fast and accurate haplotyping from pedigree genotypes. Bioinformatics 2015; 31:3984-7. [PMID: 26286810 DOI: 10.1093/bioinformatics/btv452] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Accepted: 07/28/2015] [Indexed: 01/12/2023] Open
Abstract
UNLABELLED We present a software tool called GenomeLaser that determines the haplotypes of each person from unphased high-throughput genotypes in family pedigrees. This method features high accuracy, chromosome-range phasing distance, linear computing, flexible pedigree types and flexible genetic marker types. AVAILABILITY AND IMPLEMENTATION http://www.4dgenome.com/software/genomelaser.html.
Collapse
Affiliation(s)
- Wenzhi Li
- Department of Neurosurgery, First Affiliated Hospital of Medical School, Xi'an Jiaotong University, Xi'an, Shaanxi, 710061 China, Cardiovascular Research Institute and Department of Medicine, Morehouse School of Medicine, Atlanta, GA, 30310 USA
| | - Guoxing Fu
- 4DGENOME Inc, Atlanta, GA, 30033 USA and
| | | | - Wei Xu
- Cardiovascular Research Institute and Department of Medicine, Morehouse School of Medicine, Atlanta, GA, 30310 USA
| | - Li Ma
- Cardiovascular Research Institute and Department of Medicine, Morehouse School of Medicine, Atlanta, GA, 30310 USA, 4DGENOME Inc, Atlanta, GA, 30033 USA and
| | - Shiwen Guo
- Department of Neurosurgery, First Affiliated Hospital of Medical School, Xi'an Jiaotong University, Xi'an, Shaanxi, 710061 China
| | - Qing Song
- Cardiovascular Research Institute and Department of Medicine, Morehouse School of Medicine, Atlanta, GA, 30310 USA, 4DGENOME Inc, Atlanta, GA, 30033 USA and Center of Big Data and Bioinformatics, First Affiliated Hospital of Medical School, Xi'an Jiaotong University, Xi'an, Shaanxi, 710061 China
| |
Collapse
|
20
|
Huang YS, Ramensky V, Service SK, Jasinska AJ, Jung Y, Choi OW, Cantor RM, Juretic N, Wasserscheid J, Kaplan JR, Jorgensen MJ, Dyer TD, Dewar K, Blangero J, Wilson RK, Warren W, Weinstock GM, Freimer NB. Sequencing strategies and characterization of 721 vervet monkey genomes for future genetic analyses of medically relevant traits. BMC Biol 2015; 13:41. [PMID: 26092298 PMCID: PMC4494155 DOI: 10.1186/s12915-015-0152-2] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Accepted: 06/11/2015] [Indexed: 12/30/2022] Open
Abstract
Background We report here the first genome-wide high-resolution polymorphism resource for non-human primate (NHP) association and linkage studies, constructed for the Caribbean-origin vervet monkey, or African green monkey (Chlorocebus aethiops sabaeus), one of the most widely used NHPs in biomedical research. We generated this resource by whole genome sequencing (WGS) of monkeys from the Vervet Research Colony (VRC), an NIH-supported research resource for which extensive phenotypic data are available. Results We identified genome-wide single nucleotide polymorphisms (SNPs) by WGS of 721 members of an extended pedigree from the VRC. From high-depth WGS data we identified more than 4 million polymorphic unequivocal segregating sites; by pruning these SNPs based on heterozygosity, quality control filters, and the degree of linkage disequilibrium (LD) between SNPs, we constructed genome-wide panels suitable for genetic association (about 500,000 SNPs) and linkage analysis (about 150,000 SNPs). To further enhance the utility of these resources for linkage analysis, we used a further pruned subset of the linkage panel to generate multipoint identity by descent matrices. Conclusions The genetic and phenotypic resources now available for the VRC and other Caribbean-origin vervets enable their use for genetic investigation of traits relevant to human diseases. Electronic supplementary material The online version of this article (doi:10.1186/s12915-015-0152-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yu S Huang
- Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA, 90095, USA.,Present address: 5200 Illumina Way, San Diego, CA, 92122, USA
| | - Vasily Ramensky
- Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Susan K Service
- Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Anna J Jasinska
- Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA, 90095, USA.,Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland
| | - Yoon Jung
- Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Oi-Wa Choi
- Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Rita M Cantor
- Department of Human Genetics, University of California, Los Angeles, CA, 90095, USA
| | - Nikoleta Juretic
- Department of Human Genetics, McGill University, Montreal, Canada
| | | | - Jay R Kaplan
- Department of Pathology, Section on Comparative Medicine, Wake Forest School of Medicine, Medical Center Boulevard, Winston-Salem, NC, 27157-1040, USA
| | - Matthew J Jorgensen
- Department of Pathology, Section on Comparative Medicine, Wake Forest School of Medicine, Medical Center Boulevard, Winston-Salem, NC, 27157-1040, USA
| | - Thomas D Dyer
- South Texas Diabetes and Obesity Institute, UTHSCSA/UTRGV, Brownsville, TX, USA
| | - Ken Dewar
- Department of Human Genetics, McGill University, Montreal, Canada
| | - John Blangero
- South Texas Diabetes and Obesity Institute, UTHSCSA/UTRGV, Brownsville, TX, USA
| | - Richard K Wilson
- The Genome Institute, Washington University School of Medicine, Genome Sequencing Center, St. Louis, MO, 63108, USA
| | - Wesley Warren
- The Genome Institute, Washington University School of Medicine, Genome Sequencing Center, St. Louis, MO, 63108, USA
| | | | - Nelson B Freimer
- Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA, 90095, USA.
| |
Collapse
|
21
|
Leveraging Identity-by-Descent for Accurate Genotype Inference in Family Sequencing Data. PLoS Genet 2015; 11:e1005271. [PMID: 26043085 PMCID: PMC4456389 DOI: 10.1371/journal.pgen.1005271] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 05/12/2015] [Indexed: 12/23/2022] Open
Abstract
Sequencing family DNA samples provides an attractive alternative to population based designs to identify rare variants associated with human disease due to the enrichment of causal variants in pedigrees. Previous studies showed that genotype calling accuracy can be improved by modeling family relatedness compared to standard calling algorithms. Current family-based variant calling methods use sequencing data on single variants and ignore the identity-by-descent (IBD) sharing along the genome. In this study we describe a new computational framework to accurately estimate the IBD sharing from the sequencing data, and to utilize the inferred IBD among family members to jointly call genotypes in pedigrees. Through simulations and application to real data, we showed that IBD can be reliably estimated across the genome, even at very low coverage (e.g. 2X), and genotype accuracy can be dramatically improved. Moreover, the improvement is more pronounced for variants with low frequencies, especially at low to intermediate coverage (e.g. 10X to 20X), making our approach effective in studying rare variants in cost-effective whole genome sequencing in pedigrees. We hope that our tool is useful to the research community for identifying rare variants for human disease through family-based sequencing. To identify disease variants that occur less frequently in population, sequencing families in which multiple individuals are affected is more powerful due to the enrichment of causal variants. An important step in such studies is to infer individual genotypes from sequencing data. Existing methods do not utilize full familial transmission information and therefore result in reduced accuracy of inferred genotypes. In this study we describe a new method that infers shared genetic materials among family members and then incorporate the shared genomic information in a novel algorithm that can accurately infer genotypes. Our method is particularly advantageous when inferring low frequency variants with fewer sequence data, making it effective in analyzing genome-wide sequence data. We implemented the algorithm in a computationally efficient tool to facilitate cost-effective sequencing in families for identifying disease genetic variants.
Collapse
|
22
|
Li J, Jiang Y, Wang T, Chen H, Xie Q, Shao Q, Ran X, Xia K, Sun ZS, Wu J. mirTrios: an integrated pipeline for detection of de novo and rare inherited mutations from trios-based next-generation sequencing. J Med Genet 2015; 52:275-81. [PMID: 25596308 DOI: 10.1136/jmedgenet-2014-102656] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
OBJECTIVES Recently, several studies documented that de novo mutations (DNMs) play important roles in the aetiology of sporadic diseases. Next-generation sequencing (NGS) enables variant calling at single-base resolution on a genome-wide scale. However, accurate identification of DNMs from NGS data still remains a major challenge. We developed mirTrios, a web server, to accurately detect DNMs and rare inherited mutations from NGS data in sporadic diseases. METHODS The expectation-maximisation (EM) model was adopted to accurately identify DNMs from variant call files of a trio generated by GATK (Genome Analysis Toolkit). The GATK results, which contain certain basic properties (such as PL, PRT and PART), are iteratively integrated into the EM model to strike a threshold for DNMs detection. Training sets of true and false positive DNMs in the EM model were built from whole genome sequencing data of 64 trios. RESULTS With our in-house whole exome sequencing datasets from 20 trios, mirTrios totally identified 27 DNMs in the coding region, 25 of which (92.6%) are validated as true positives. In addition, to facilitate the interpretation of diverse mutations, mirTrios can also be employed in the identification of rare inherited mutations. Embedded with abundant annotation of DNMs and rare inherited mutations, mirTrios also supports known diagnostic variants and causative gene identification, as well as the prioritisation of novel and promising candidate genes. CONCLUSIONS mirTrios provides an intuitive interface for the general geneticist and clinician, and can be widely used for detection of DNMs and rare inherited mutations, and annotation in sporadic diseases. mirTrios is freely available at http://centre.bioinformatics.zj.cn/mirTrios/.
Collapse
Affiliation(s)
- Jinchen Li
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China State Key Laboratory of Medical Genetics, Central South University, Changsha, China
| | - Yi Jiang
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China
| | - Tao Wang
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China
| | - Huiqian Chen
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China
| | - Qing Xie
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China
| | - Qianzhi Shao
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China
| | - Xia Ran
- Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China
| | - Kun Xia
- State Key Laboratory of Medical Genetics, Central South University, Changsha, China
| | - Zhong Sheng Sun
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China
| | - Jinyu Wu
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China Institute of Genomic Medicine, Wenzhou Medical University, Wenzhou, China
| |
Collapse
|
23
|
Chen R, Wei Q, Zhan X, Zhong X, Sutcliffe JS, Cox NJ, Cook EH, Li C, Chen W, Li B. A haplotype-based framework for group-wise transmission/disequilibrium tests for rare variant association analysis. ACTA ACUST UNITED AC 2015; 31:1452-9. [PMID: 25568282 DOI: 10.1093/bioinformatics/btu860] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Accepted: 12/23/2014] [Indexed: 12/30/2022]
Abstract
MOTIVATION A major focus of current sequencing studies for human genetics is to identify rare variants associated with complex diseases. Aside from reduced power of detecting associated rare variants, controlling for population stratification is particularly challenging for rare variants. Transmission/disequilibrium tests (TDT) based on family designs are robust to population stratification and admixture, and therefore provide an effective approach to rare variant association studies to eliminate spurious associations. To increase power of rare variant association analysis, gene-based collapsing methods become standard approaches for analyzing rare variants. Existing methods that extend this strategy to rare variants in families usually combine TDT statistics at individual variants and therefore lack the flexibility of incorporating other genetic models. RESULTS In this study, we describe a haplotype-based framework for group-wise TDT (gTDT) that is flexible to encompass a variety of genetic models such as additive, dominant and compound heterozygous (CH) (i.e. recessive) models as well as other complex interactions. Unlike existing methods, gTDT constructs haplotypes by transmission when possible and inherently takes into account the linkage disequilibrium among variants. Through extensive simulations we showed that type I error was correctly controlled for rare variants under all models investigated, and this remained true in the presence of population stratification. Under a variety of genetic models, gTDT showed increased power compared with the single marker TDT. Application of gTDT to an autism exome sequencing data of 118 trios identified potentially interesting candidate genes with CH rare variants. AVAILABILITY AND IMPLEMENTATION We implemented gTDT in C++ and the source code and the detailed usage are available on the authors' website (https://medschool.vanderbilt.edu/cgg). CONTACT bingshan.li@vanderbilt.edu or wei.chen@chp.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rui Chen
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Qiang Wei
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Xiaowei Zhan
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Xue Zhong
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - James S Sutcliffe
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Nancy J Cox
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Edwin H Cook
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chun Li
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Wei Chen
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Bingshan Li
- Department of Molecular Physiology and Biophysics, Vanderbilt University, TN, 37221, USA, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX, USA, Center for Quantitative Sciences, Vanderbilt University, TN, 37221, USA, Department of Medicine, University of Chicago, Chicago, IL, USA, Department of Psychiatry, University of Illinois at Chicago, Chicago, IL, USA, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA and Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
24
|
Yoshida K, Sasaki E, Kamoun S. Computational analyses of ancient pathogen DNA from herbarium samples: challenges and prospects. FRONTIERS IN PLANT SCIENCE 2015; 6:771. [PMID: 26442080 PMCID: PMC4585160 DOI: 10.3389/fpls.2015.00771] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/08/2015] [Accepted: 09/07/2015] [Indexed: 05/20/2023]
Abstract
The application of DNA sequencing technology to the study of ancient DNA has enabled the reconstruction of past epidemics from genomes of historically important plant-associated microbes. Recently, the genome sequences of the potato late blight pathogen Phytophthora infestans were analyzed from 19th century herbarium specimens. These herbarium samples originated from infected potatoes collected during and after the Irish potato famine. Herbaria have therefore great potential to help elucidate past epidemics of crops, date the emergence of pathogens, and inform about past pathogen population dynamics. DNA preservation in herbarium samples was unexpectedly good, raising the possibility of a whole new research area in plant and microbial genomics. However, the recovered DNA can be extremely fragmented resulting in specific challenges in reconstructing genome sequences. Here we review some of the challenges in computational analyses of ancient DNA from herbarium samples. We also applied the recently developed linkage method to haplotype reconstruction of diploid or polyploid genomes from fragmented ancient DNA.
Collapse
Affiliation(s)
- Kentaro Yoshida
- Laboratory of Plant Genetics, Graduate School of Agricultural Science, Kobe UniversityKobe, Japan
- The Sainsbury Laboratory, Norwich Research ParkNorwich, UK
- *Correspondence: Kentaro Yoshida, Laboratory of Plant Genetics, Graduate School of Agricultural Science, Kobe University, 1-1 Rokkodai, Nada-Ku, Kobe, Japan,
| | - Eriko Sasaki
- Gregor Mendel Institute, Austrian Academy of Sciences, ViennaAustria
| | - Sophien Kamoun
- The Sainsbury Laboratory, Norwich Research ParkNorwich, UK
| |
Collapse
|
25
|
Kumar P, Al-Shafai M, Al Muftah WA, Chalhoub N, Elsaid MF, Aleem AA, Suhre K. Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and Mendelian inheritance. BMC Res Notes 2014; 7:747. [PMID: 25339461 PMCID: PMC4216909 DOI: 10.1186/1756-0500-7-747] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Accepted: 10/03/2014] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND With diminishing costs of next generation sequencing (NGS), whole genome analysis becomes a standard tool for identifying genetic causes of inherited diseases. Commercial NGS service providers in general not only provide raw genomic reads, but further deliver SNP calls to their clients. However, the question for the user arises whether to use the SNP data as is, or process the raw sequencing data further through more sophisticated SNP calling pipelines with more advanced algorithms. RESULTS Here we report a detailed comparison of SNPs called using the popular GATK multiple-sample calling protocol to SNPs delivered as part of a 40x whole genome sequencing project by Illumina Inc of 171 human genomes of Arab descent (108 unrelated Qatari genomes, 19 trios, and 2 families with rare diseases) and compare them to variants provided by the Illumina CASAVA pipeline. GATK multi-sample calling identifies more variants than the CASAVA pipeline. The additional variants from GATK are robust for Mendelian consistencies but weak in terms of statistical parameters such as TsTv ratio. However, these additional variants do not make a difference in detecting the causative variants in the studied phenotype. CONCLUSION Both pipelines, GATK multi-sample calling and Illumina CASAVA single sample calling, have highly similar performance in SNP calling at the level of putatively causative variants.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Karsten Suhre
- Weill Cornell Medical College in Qatar, Education City, Doha, Qatar.
| |
Collapse
|
26
|
Abstract
Restriction site-associated DNA sequencing or genotyping-by-sequencing (GBS) approaches allow for rapid and cost-effective discovery and genotyping of thousands of single-nucleotide polymorphisms (SNPs) in multiple individuals. However, rigorous quality control practices are needed to avoid high levels of error and bias with these reduced representation methods. We developed a formal statistical framework for filtering spurious loci, using Mendelian inheritance patterns in nuclear families, that accommodates variable-quality genotype calls and missing data--both rampant issues with GBS data--and for identifying sex-linked SNPs. Simulations predict excellent performance of both the Mendelian filter and the sex-linkage assignment under a variety of conditions. We further evaluate our method by applying it to real GBS data and validating a subset of high-quality SNPs. These results demonstrate that our metric of Mendelian inheritance is a powerful quality filter for GBS loci that is complementary to standard coverage and Hardy-Weinberg filters. The described method, implemented in the software MendelChecker, will improve quality control during SNP discovery in nonmodel as well as model organisms.
Collapse
|
27
|
Fine-scale human genetic structure in Western France. Eur J Hum Genet 2014; 23:831-6. [PMID: 25182131 DOI: 10.1038/ejhg.2014.175] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2013] [Revised: 07/21/2014] [Accepted: 07/30/2014] [Indexed: 11/08/2022] Open
Abstract
The difficulties arising from association analysis with rare variants underline the importance of suitable reference population cohorts, which integrate detailed spatial information. We analyzed a sample of 1684 individuals from Western France, who were genotyped at genome-wide level, from two cohorts D.E.S.I.R and CavsGen. We found that fine-scale population structure occurs at the scale of Western France, with distinct admixture proportions for individuals originating from the Brittany Region and the Vendée Department. Genetic differentiation increases with distance at a high rate in these two parts of Northwestern France and linkage disequilibrium is higher in Brittany suggesting a lower effective population size. When looking for genomic regions informative about Breton origin, we found two prominent associated regions that include the lactase region and the HLA complex. For both the lactase and the HLA regions, there is a low differentiation between Bretons and Irish, and this is also found at the genome-wide level. At a more refined scale, and within the Pays de la Loire Region, we also found evidence of fine-scale population structure, although principal component analysis showed that individuals from different departments cannot be confidently discriminated. Because of the evidence for fine-scale genetic structure in Western France, we anticipate that rare and geographically localized variants will be identified in future full-sequence analyses.
Collapse
|
28
|
Teare MD, Santibañez Koref MF. Linkage analysis and the study of Mendelian disease in the era of whole exome and genome sequencing. Brief Funct Genomics 2014; 13:378-83. [PMID: 25024279 DOI: 10.1093/bfgp/elu024] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Whole exome and whole genome sequencing are now routinely used in the study of inherited disease, and some of their major successes have been the identification of genes involved in disease predisposition in pedigrees where disease seems to follow Mendelian inheritance patterns. These successes include scenarios where only a single individual was sequenced and raise the question whether linkage analysis has become superfluous. Linkage analysis requires genome-wide genotyping on family-based data, and traditionally the linkage analysis was performed before the targeting sequencing stage. However, methods are emerging that seek to exploit the capability of linkage analysis to integrate data both across individuals and across pedigrees. This ability has been exploited to select samples used for sequencing studies and to identify among the variants uncovered by sequencing those mapping to regions likely to contain the gene of interest and, more generally, to improve variant detection. So, although the formal isolated linkage analysis stage is less commonly seen, when uncovering the genetic basis of Mendelian disease, methods relying heavily on genetic linkage analysis principles are being integrated directly into the whole mapping process ranging from sample selection to variant calling and filtering.
Collapse
|
29
|
Liu D, Ma C, Hong W, Huang L, Liu M, Liu H, Zeng H, Deng D, Xin H, Song J, Xu C, Sun X, Hou X, Wang X, Zheng H. Construction and analysis of high-density linkage map using high-throughput sequencing data. PLoS One 2014; 9:e98855. [PMID: 24905985 PMCID: PMC4048240 DOI: 10.1371/journal.pone.0098855] [Citation(s) in RCA: 180] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2014] [Accepted: 05/08/2014] [Indexed: 12/31/2022] Open
Abstract
Linkage maps enable the study of important biological questions. The construction of high-density linkage maps appears more feasible since the advent of next-generation sequencing (NGS), which eases SNP discovery and high-throughput genotyping of large population. However, the marker number explosion and genotyping errors from NGS data challenge the computational efficiency and linkage map quality of linkage study methods. Here we report the HighMap method for constructing high-density linkage maps from NGS data. HighMap employs an iterative ordering and error correction strategy based on a k-nearest neighbor algorithm and a Monte Carlo multipoint maximum likelihood algorithm. Simulation study shows HighMap can create a linkage map with three times as many markers as ordering-only methods while offering more accurate marker orders and stable genetic distances. Using HighMap, we constructed a common carp linkage map with 10,004 markers. The singleton rate was less than one-ninth of that generated by JoinMap4.1. Its total map distance was 5,908 cM, consistent with reports on low-density maps. HighMap is an efficient method for constructing high-density, high-quality linkage maps from high-throughput population NGS data. It will facilitate genome assembling, comparative genomic analysis, and QTL studies. HighMap is available at http://highmap.biomarker.com.cn/.
Collapse
Affiliation(s)
- Dongyuan Liu
- Biomarker Technologies Corporation, Beijing, China
| | - Chouxian Ma
- Biomarker Technologies Corporation, Beijing, China
| | - Weiguo Hong
- Biomarker Technologies Corporation, Beijing, China
| | - Long Huang
- Biomarker Technologies Corporation, Beijing, China
| | - Min Liu
- Biomarker Technologies Corporation, Beijing, China
| | - Hui Liu
- Biomarker Technologies Corporation, Beijing, China
| | - Huaping Zeng
- Biomarker Technologies Corporation, Beijing, China
| | - Dejing Deng
- Biomarker Technologies Corporation, Beijing, China
| | - Huaigen Xin
- Biomarker Technologies Corporation, Beijing, China
| | - Jun Song
- Biomarker Technologies Corporation, Beijing, China
| | - Chunhua Xu
- Biomarker Technologies Corporation, Beijing, China
| | - Xiaowen Sun
- Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, China
| | - Xilin Hou
- State Key laboratory of Crop Genetic and Germplasm Enhancement, Key Laboratory of Biology and Germplasm Enhancement of Horticultural Crops in East China, Ministry of Agriculture, Nanjing Agricultural University, Nanjing, China
| | - Xiaowu Wang
- Biomarker Technologies Corporation, Beijing, China
- Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences (IVF, CAAS), Beijing, China
- * E-mail: (XWW) (XW); (HKZ) (HZ)
| | - Hongkun Zheng
- Biomarker Technologies Corporation, Beijing, China
- * E-mail: (XWW) (XW); (HKZ) (HZ)
| |
Collapse
|
30
|
Genomic and phenotypic characterization of a wild medaka population: towards the establishment of an isogenic population genetic resource in fish. G3-GENES GENOMES GENETICS 2014; 4:433-45. [PMID: 24408034 PMCID: PMC3962483 DOI: 10.1534/g3.113.008722] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Oryzias latipes (medaka) has been established as a vertebrate genetic model for more than a century and recently has been rediscovered outside its native Japan. The power of new sequencing methods now makes it possible to reinvigorate medaka genetics, in particular by establishing a near-isogenic panel derived from a single wild population. Here we characterize the genomes of wild medaka catches obtained from a single Southern Japanese population in Kiyosu as a precursor for the establishment of a near-isogenic panel of wild lines. The population is free of significant detrimental population structure and has advantageous linkage disequilibrium properties suitable for the establishment of the proposed panel. Analysis of morphometric traits in five representative inbred strains suggests phenotypic mapping will be feasible in the panel. In addition, high-throughput genome sequencing of these medaka strains confirms their evolutionary relationships on lines of geographic separation and provides further evidence that there has been little significant interbreeding between the Southern and Northern medaka population since the Southern/Northern population split. The sequence data suggest that the Southern Japanese medaka existed as a larger older population that went through a relatively recent bottleneck approximately 10,000 years ago. In addition, we detect patterns of recent positive selection in the Southern population. These data indicate that the genetic structure of the Kiyosu medaka samples is suitable for the establishment of a vertebrate near-isogenic panel and therefore inbreeding of 200 lines based on this population has commenced. Progress of this project can be tracked at http://www.ebi.ac.uk/birney-srv/medaka-ref-panel.
Collapse
|
31
|
Li B, Liu DJ, Leal SM. Identifying rare variants associated with complex traits via sequencing. ACTA ACUST UNITED AC 2014; Chapter 1:Unit 1.26. [PMID: 23853079 DOI: 10.1002/0471142905.hg0126s78] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Although genome-wide association studies have been successful in detecting associations with common variants, there is currently an increasing interest in identifying low-frequency and rare variants associated with complex traits. Next-generation sequencing technologies make it feasible to survey the full spectrum of genetic variation in coding regions or the entire genome. The association analysis for rare variants is challenging, and traditional methods are ineffective, however, due to the low frequency of rare variants, coupled with allelic heterogeneity. Recently a battery of new statistical methods has been proposed for identifying rare variants associated with complex traits. These methods test for associations by aggregating multiple rare variants across a gene or a genomic region or among a group of variants in the genome. In this unit, we describe key concepts for rare variant association for complex traits, survey some of the recent methods, discuss their statistical power under various scenarios, and provide practical guidance on analyzing next-generation sequencing data for identifying rare variants associated with complex traits.
Collapse
Affiliation(s)
- Bingshan Li
- Department of Molecular Physiology and Biophysics, Center for Human Genetics Research, Vanderbilt University, Nashville, Tennessee, USA
| | | | | |
Collapse
|
32
|
He Z, O'Roak BJ, Smith JD, Wang G, Hooker S, Santos-Cortez RLP, Li B, Kan M, Krumm N, Nickerson DA, Shendure J, Eichler EE, Leal SM. Rare-variant extensions of the transmission disequilibrium test: application to autism exome sequence data. Am J Hum Genet 2014; 94:33-46. [PMID: 24360806 DOI: 10.1016/j.ajhg.2013.11.021] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2013] [Accepted: 11/26/2013] [Indexed: 11/18/2022] Open
Abstract
Many population-based rare-variant (RV) association tests, which aggregate variants across a region, have been developed to analyze sequence data. A drawback of analyzing population-based data is that it is difficult to adequately control for population substructure and admixture, and spurious associations can occur. For RVs, this problem can be substantial, because the spectrum of rare variation can differ greatly between populations. A solution is to analyze parent-child trio data, by using the transmission disequilibrium test (TDT), which is robust to population substructure and admixture. We extended the TDT to test for RV associations using four commonly used methods. We demonstrate that for all RV-TDT methods, using proper analysis strategies, type I error is well-controlled even when there are high levels of population substructure or admixture. For trio data, unlike for population-based data, RV allele-counting association methods will lead to inflated type I errors. However type I errors can be properly controlled by obtaining p values empirically through haplotype permutation. The power of the RV-TDT methods was evaluated and compared to the analysis of case-control data with a number of genetic and disease models. The RV-TDT was also used to analyze exome data from 199 Simons Simplex Collection autism trios and an association was observed with variants in ABCA7. Given the problem of adequately controlling for population substructure and admixture in RV association studies and the growing number of sequence-based trio studies, the RV-TDT is extremely beneficial to elucidate the involvement of RVs in the etiology of complex traits.
Collapse
Affiliation(s)
- Zongxiao He
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Brian J O'Roak
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Joshua D Smith
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Gao Wang
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Stanley Hooker
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Regie Lyn P Santos-Cortez
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Biao Li
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Mengyuan Kan
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Nik Krumm
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Deborah A Nickerson
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Jay Shendure
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
| | - Suzanne M Leal
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.
| |
Collapse
|
33
|
Kojima K, Nariai N, Mimori T, Takahashi M, Yamaguchi-Kabata Y, Sato Y, Nagasaki M. A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads. ACTA ACUST UNITED AC 2013; 29:2835-43. [PMID: 24002111 DOI: 10.1093/bioinformatics/btt503] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Variant calling from genome-wide sequencing data is essential for the analysis of disease-causing mutations and elucidation of disease mechanisms. However, variant calling in low coverage regions is difficult due to sequence read errors and mapping errors. Hence, variant calling approaches that are robust to low coverage data are demanded. RESULTS We propose a new variant calling approach that considers pedigree information and haplotyping based on sequence reads spanning two or more heterozygous positions termed phase informative reads. In our approach, genotyping and haplotyping by the assignment of each read to a haplotype based on phase informative reads are simultaneously performed. Therefore, positions with low evidence for heterozygosity are rescued by phase informative reads, and such rescued positions contribute to haplotyping in a synergistic way. In addition, pedigree information supports more accurate haplotyping as well as genotyping, especially in low coverage regions. Although heterozygous positions are useful for haplotyping, homozygous positions are not informative and weaken the information from heterozygous positions, as majority of positions are homozygous. Thus, we introduce latent variables that determine zygosity at each position to filter out homozygous positions for haplotyping. In performance evaluation with a parent-offspring trio sequencing data, our approach outperforms existing approaches in accuracy on the agreement with single nucleotide polymorphism array genotyping results. Also, performance analysis considering distance between variants showed that the use of phase informative reads is effective for accurate variant calling, and further performance improvement is expected with longer sequencing data. CONTACT kojima@megabank.tohoku.ac.jp .
Collapse
Affiliation(s)
- Kaname Kojima
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi 980-8573, Japan
| | | | | | | | | | | | | |
Collapse
|
34
|
Goldstein DB, Allen A, Keebler J, Margulies EH, Petrou S, Petrovski S, Sunyaev S. Sequencing studies in human genetics: design and interpretation. Nat Rev Genet 2013; 14:460-70. [PMID: 23752795 DOI: 10.1038/nrg3455] [Citation(s) in RCA: 185] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Next-generation sequencing is becoming the primary discovery tool in human genetics. There have been many clear successes in identifying genes that are responsible for Mendelian diseases, and sequencing approaches are now poised to identify the mutations that cause undiagnosed childhood genetic diseases and those that predispose individuals to more common complex diseases. There are, however, growing concerns that the complexity and magnitude of complete sequence data could lead to an explosion of weakly justified claims of association between genetic variants and disease. Here, we provide an overview of the basic workflow in next-generation sequencing studies and emphasize, where possible, measures and considerations that facilitate accurate inferences from human sequencing studies.
Collapse
Affiliation(s)
- David B Goldstein
- Center for Human Genome Variation, Duke University School of Medicine, 308 Research Drive, Box 91009, LSRC B Wing, Room 330, Durham, North Carolina 27708, USA.
| | | | | | | | | | | | | |
Collapse
|
35
|
Liu X, Ong RTH, Pillai EN, Elzein AM, Small KS, Clark TG, Kwiatkowski DP, Teo YY. Detecting and characterizing genomic signatures of positive selection in global populations. Am J Hum Genet 2013; 92:866-81. [PMID: 23731540 DOI: 10.1016/j.ajhg.2013.04.021] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2012] [Revised: 04/17/2013] [Accepted: 04/24/2013] [Indexed: 12/20/2022] Open
Abstract
Natural selection is a significant force that shapes the architecture of the human genome and introduces diversity across global populations. The question of whether advantageous mutations have arisen in the human genome as a result of single or multiple mutation events remains unanswered except for the fact that there exist a handful of genes such as those that confer lactase persistence, affect skin pigmentation, or cause sickle cell anemia. We have developed a long-range-haplotype method for identifying genomic signatures of positive selection to complement existing methods, such as the integrated haplotype score (iHS) or cross-population extended haplotype homozygosity (XP-EHH), for locating signals across the entire allele frequency spectrum. Our method also locates the founder haplotypes that carry the advantageous variants and infers their corresponding population frequencies. This presents an opportunity to systematically interrogate the whole human genome whether a selection signal shared across different populations is the consequence of a single mutation process followed subsequently by gene flow between populations or of convergent evolution due to the occurrence of multiple independent mutation events either at the same variant or within the same gene. The application of our method to data from 14 populations across the world revealed that positive-selection events tend to cluster in populations of the same ancestry. Comparing the founder haplotypes for events that are present across different populations revealed that convergent evolution is a rare occurrence and that the majority of shared signals stem from the same evolutionary event.
Collapse
Affiliation(s)
- Xuanyao Liu
- NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore 117456, Singapore; Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117597, Singapore
| | | | | | | | | | | | | | | |
Collapse
|
36
|
Appels R, Barrero R, Bellgard M. Advances in biotechnology and informatics to link variation in the genome to phenotypes in plants and animals. Funct Integr Genomics 2013; 13:1-9. [PMID: 23494190 PMCID: PMC3605488 DOI: 10.1007/s10142-013-0319-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2013] [Revised: 03/02/2013] [Accepted: 03/03/2013] [Indexed: 11/27/2022]
Abstract
Advances in our understanding of genome structure provide consistent evidence for the existence of a core genome representing species classically defined by phenotype, as well as conditionally dispensable components of the genome that shows extensive variation between individuals of a given species. Generally, conservation of phenotypic features between species reflects conserved features of the genome; however, this is evidently not necessarily always the case as demonstrated by the analysis of the tunicate chordate Oikopleura dioica. In both plants and animals, the methylation activity of DNA and histones continues to present new variables for modifying (eventually) the phenotype of an organism and provides for structural variation that builds on the point mutations, rearrangements, indels, and amplification of retrotransposable elements traditionally considered. The translation of the advances in the structure/function analysis of the genome to industry is facilitated through the capture of research outputs in "toolboxes" that remain accessible in the public domain.
Collapse
Affiliation(s)
- R. Appels
- Centre for Comparative Genomics, Murdoch University, Perth, WA 6150 Australia
| | - R. Barrero
- Centre for Comparative Genomics, Murdoch University, Perth, WA 6150 Australia
| | - M. Bellgard
- Centre for Comparative Genomics, Murdoch University, Perth, WA 6150 Australia
| |
Collapse
|