1
|
High-resolution African HLA resource uncovers HLA-DRB1 expression effects underlying vaccine response. Nat Med 2024; 30:1384-1394. [PMID: 38740997 PMCID: PMC11108778 DOI: 10.1038/s41591-024-02944-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 03/25/2024] [Indexed: 05/16/2024]
Abstract
How human genetic variation contributes to vaccine effectiveness in infants is unclear, and data are limited on these relationships in populations with African ancestries. We undertook genetic analyses of vaccine antibody responses in infants from Uganda (n = 1391), Burkina Faso (n = 353) and South Africa (n = 755), identifying associations between human leukocyte antigen (HLA) and antibody response for five of eight tested antigens spanning pertussis, diphtheria and hepatitis B vaccines. In addition, through HLA typing 1,702 individuals from 11 populations of African ancestry derived predominantly from the 1000 Genomes Project, we constructed an imputation resource, fine-mapping class II HLA-DR and DQ associations explaining up to 10% of antibody response variance in our infant cohorts. We observed differences in the genetic architecture of pertussis antibody response between the cohorts with African ancestries and an independent cohort with European ancestry, but found no in silico evidence of differences in HLA peptide binding affinity or breadth. Using immune cell expression quantitative trait loci datasets derived from African-ancestry samples from the 1000 Genomes Project, we found evidence of differential HLA-DRB1 expression correlating with inferred protection from pertussis following vaccination. This work suggests that HLA-DRB1 expression may play a role in vaccine response and should be considered alongside peptide selection to improve vaccine design.
Collapse
|
2
|
A common NFKB1 variant detected through antibody analysis in UK Biobank predicts risk of infection and allergy. Am J Hum Genet 2024; 111:295-308. [PMID: 38232728 PMCID: PMC10870136 DOI: 10.1016/j.ajhg.2023.12.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 12/07/2023] [Accepted: 12/10/2023] [Indexed: 01/19/2024] Open
Abstract
Infectious agents contribute significantly to the global burden of diseases through both acute infection and their chronic sequelae. We leveraged the UK Biobank to identify genetic loci that influence humoral immune response to multiple infections. From 45 genome-wide association studies in 9,611 participants from UK Biobank, we identified NFKB1 as a locus associated with quantitative antibody responses to multiple pathogens, including those from the herpes, retro-, and polyoma-virus families. An insertion-deletion variant thought to affect NFKB1 expression (rs28362491), was mapped as the likely causal variant and could play a key role in regulation of the immune response. Using 121 infection- and inflammation-related traits in 487,297 UK Biobank participants, we show that the deletion allele was associated with an increased risk of infection from diverse pathogens but had a protective effect against allergic disease. We propose that altered expression of NFKB1, as a result of the deletion, modulates hematopoietic pathways and likely impacts cell survival, antibody production, and inflammation. Taken together, we show that disruptions to the tightly regulated immune processes may tip the balance between exacerbated immune responses and allergy, or increased risk of infection and impaired resolution of inflammation.
Collapse
|
3
|
Age-dependent topic modeling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk. Nat Genet 2023; 55:1854-1865. [PMID: 37814053 PMCID: PMC10632146 DOI: 10.1038/s41588-023-01522-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 08/31/2023] [Indexed: 10/11/2023]
Abstract
The analysis of longitudinal data from electronic health records (EHRs) has the potential to improve clinical diagnoses and enable personalized medicine, motivating efforts to identify disease subtypes from patient comorbidity information. Here we introduce an age-dependent topic modeling (ATM) method that provides a low-rank representation of longitudinal records of hundreds of distinct diseases in large EHR datasets. We applied ATM to 282,957 UK Biobank samples, identifying 52 diseases with heterogeneous comorbidity profiles; analyses of 211,908 All of Us samples produced concordant results. We defined subtypes of the 52 heterogeneous diseases based on their comorbidity profiles and compared genetic risk across disease subtypes using polygenic risk scores (PRSs), identifying 18 disease subtypes whose PRS differed significantly from other subtypes of the same disease. We further identified specific genetic variants with subtype-dependent effects on disease risk. In conclusion, ATM identifies disease subtypes with differential genome-wide and locus-specific genetic risk profiles.
Collapse
|
4
|
Topic modeling identifies novel genetic loci associated with multimorbidities in UK Biobank. CELL GENOMICS 2023; 3:100371. [PMID: 37601973 PMCID: PMC10435382 DOI: 10.1016/j.xgen.2023.100371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2022] [Revised: 05/04/2023] [Accepted: 07/07/2023] [Indexed: 08/22/2023]
Abstract
Many diseases show patterns of co-occurrence, possibly driven by systemic dysregulation of underlying processes affecting multiple traits. We have developed a method (treeLFA) for identifying such multimorbidities from routine health-care data, which combines topic modeling with an informative prior derived from medical ontology. We apply treeLFA to UK Biobank data and identify a variety of topics representing multimorbidity clusters, including a healthy topic. We find that loci identified using topic weights as traits in a genome-wide association study (GWAS) analysis, which we validated with a range of approaches, only partially overlap with loci from GWASs on constituent single diseases. We also show that treeLFA improves upon existing methods like latent Dirichlet allocation in various ways. Overall, our findings indicate that topic models can characterize multimorbidity patterns and that genetic analysis of these patterns can provide insight into the etiology of complex traits that cannot be determined from the analysis of constituent traits alone.
Collapse
|
5
|
Optimal strategies for learning multi-ancestry polygenic scores vary across traits. Nat Commun 2023; 14:4023. [PMID: 37419925 PMCID: PMC10328935 DOI: 10.1038/s41467-023-38930-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 05/22/2023] [Indexed: 07/09/2023] Open
Abstract
Polygenic scores (PGSs) are individual-level measures that aggregate the genome-wide genetic predisposition to a given trait. As PGS have predominantly been developed using European-ancestry samples, trait prediction using such European ancestry-derived PGS is less accurate in non-European ancestry individuals. Although there has been recent progress in combining multiple PGS trained on distinct populations, the problem of how to maximize performance given a multiple-ancestry cohort is largely unexplored. Here, we investigate the effect of sample size and ancestry composition on PGS performance for fifteen traits in UK Biobank. For some traits, PGS estimated using a relatively small African-ancestry training set outperformed, on an African-ancestry test set, PGS estimated using a much larger European-ancestry only training set. We observe similar, but not identical, results when considering other minority-ancestry groups within UK Biobank. Our results emphasise the importance of targeted data collection from underrepresented groups in order to address existing disparities in PGS performance.
Collapse
|
6
|
Mouse fetal growth restriction through parental and fetal immune gene variation and intercellular communications cascade. Nat Commun 2022; 13:4398. [PMID: 35906236 PMCID: PMC9338297 DOI: 10.1038/s41467-022-32171-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Accepted: 07/18/2022] [Indexed: 11/08/2022] Open
Abstract
Fetal growth restriction (FGR) affects 5-10% of pregnancies, and can have serious consequences for both mother and child. Prevention and treatment are limited because FGR pathogenesis is poorly understood. Genetic studies implicate KIR and HLA genes in FGR, however, linkage disequilibrium, genetic influence from both parents, and challenges with investigating human pregnancies make the risk alleles and their functional effects difficult to map. Here, we demonstrate that the interaction between the maternal KIR2DL1, expressed on uterine natural killer (NK) cells, and the paternally inherited HLA-C*0501, expressed on fetal trophoblast cells, leads to FGR in a humanized mouse model. We show that the KIR2DL1 and C*0501 interaction leads to pathogenic uterine arterial remodeling and modulation of uterine NK cell function. This initial effect cascades to altered transcriptional expression and intercellular communication at the maternal-fetal interface. These findings provide mechanistic insight into specific FGR risk alleles, and provide avenues of prevention and treatment.
Collapse
|
7
|
Abstract
The field of population genomics has grown rapidly in response to the recent advent of affordable, large-scale sequencing technologies. As opposed to the situation during the majority of the 20th century, in which the development of theoretical and statistical population genetic insights outpaced the generation of data to which they could be applied, genomic data are now being produced at a far greater rate than they can be meaningfully analyzed and interpreted. With this wealth of data has come a tendency to focus on fitting specific (and often rather idiosyncratic) models to data, at the expense of a careful exploration of the range of possible underlying evolutionary processes. For example, the approach of directly investigating models of adaptive evolution in each newly sequenced population or species often neglects the fact that a thorough characterization of ubiquitous nonadaptive processes is a prerequisite for accurate inference. We here describe the perils of these tendencies, present our consensus views on current best practices in population genomic data analysis, and highlight areas of statistical inference and theory that are in need of further attention. Thereby, we argue for the importance of defining a biologically relevant baseline model tuned to the details of each new analysis, of skepticism and scrutiny in interpreting model fitting results, and of carefully defining addressable hypotheses and underlying uncertainties.
Collapse
|
8
|
Identification of host-pathogen-disease relationships using a scalable multiplex serology platform in UK Biobank. Nat Commun 2022; 13:1818. [PMID: 35383168 PMCID: PMC8983701 DOI: 10.1038/s41467-022-29307-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 03/04/2022] [Indexed: 12/12/2022] Open
Abstract
Certain infectious agents are recognised causes of cancer and other chronic diseases. To understand the pathological mechanisms underlying such relationships, here we design a Multiplex Serology platform to measure quantitative antibody responses against 45 antigens from 20 infectious agents including human herpes, hepatitis, polyoma, papilloma, and retroviruses, as well as Chlamydia trachomatis, Helicobacter pylori and Toxoplasma gondii, then assayed a random subset of 9695 UK Biobank participants. We find seroprevalence estimates consistent with those expected from prior literature and confirm multiple associations of antibody responses with sociodemographic characteristics (e.g., lifetime sexual partners with C. trachomatis), HLA genetic variants (rs6927022 with Epstein-Barr virus (EBV) EBNA1 antibodies) and disease outcomes (human papillomavirus-16 seropositivity with cervical intraepithelial neoplasia, and EBV responses with multiple sclerosis). Our accessible dataset is one of the largest incorporating diverse infectious agents in a prospective UK cohort offering opportunities to improve our understanding of host-pathogen-disease relationships with significant clinical and public health implications.
Collapse
|
9
|
Abstract
The sequencing of modern and ancient genomes from around the world has revolutionized our understanding of human history and evolution. However, the problem of how best to characterize ancestral relationships from the totality of human genomic variation remains unsolved. Here, we address this challenge with nonparametric methods that enable us to infer a unified genealogy of modern and ancient humans. This compact representation of multiple datasets explores the challenges of missing and erroneous data and uses ancient samples to constrain and date relationships. We demonstrate the power of the method to recover relationships between individuals and populations as well as to identify descendants of ancient samples. Finally, we introduce a simple nonparametric estimator of the geographical location of ancestors that recapitulates key events in human history.
Collapse
|
10
|
Genome-wide analysis of 53,400 people with irritable bowel syndrome highlights shared genetic pathways with mood and anxiety disorders. Nat Genet 2021; 53:1543-1552. [PMID: 34741163 PMCID: PMC8571093 DOI: 10.1038/s41588-021-00950-8] [Citation(s) in RCA: 74] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Accepted: 09/08/2021] [Indexed: 12/19/2022]
Abstract
Irritable bowel syndrome (IBS) results from disordered brain-gut interactions. Identifying susceptibility genes could highlight the underlying pathophysiological mechanisms. We designed a digestive health questionnaire for UK Biobank and combined identified cases with IBS with independent cohorts. We conducted a genome-wide association study with 53,400 cases and 433,201 controls and replicated significant associations in a 23andMe panel (205,252 cases and 1,384,055 controls). Our study identified and confirmed six genetic susceptibility loci for IBS. Implicated genes included NCAM1, CADM2, PHF2/FAM120A, DOCK9, CKAP2/TPTE2P3 and BAG6. The first four are associated with mood and anxiety disorders, expressed in the nervous system, or both. Mirroring this, we also found strong genome-wide correlation between the risk of IBS and anxiety, neuroticism and depression (rg > 0.5). Additional analyses suggested this arises due to shared pathogenic pathways rather than, for example, anxiety causing abdominal symptoms. Implicated mechanisms require further exploration to help understand the altered brain-gut interactions underlying IBS.
Collapse
|
11
|
Abstract
Inherited genetic variation contributes to individual risk for many complex diseases and is increasingly being used for predictive patient stratification. Previous work has shown that genetic factors are not equally relevant to human traits across age and other contexts, though the reasons for such variation are not clear. Here, we introduce methods to infer the form of the longitudinal relationship between genetic relative risk for disease and age and to test whether all genetic risk factors behave similarly. We use a proportional hazards model within an interval-based censoring methodology to estimate age-varying individual variant contributions to genetic relative risk for 24 common diseases within the British ancestry subset of UK Biobank, applying a Bayesian clustering approach to group variants by their relative risk profile over age and permutation tests for age dependency and multiplicity of profiles. We find evidence for age-varying relative risk profiles in nine diseases, including hypertension, skin cancer, atherosclerotic heart disease, hypothyroidism and calculus of gallbladder, several of which show evidence, albeit weak, for multiple distinct profiles of genetic relative risk. The predominant pattern shows genetic risk factors having the greatest relative impact on risk of early disease, with a monotonic decrease over time, at least for the majority of variants, although the magnitude and form of the decrease varies among diseases. As a consequence, for diseases where genetic relative risk decreases over age, genetic risk factors have stronger explanatory power among younger populations, compared to older ones. We show that these patterns cannot be explained by a simple model involving the presence of unobserved covariates such as environmental factors. We discuss possible models that can explain our observations and the implications for genetic risk prediction. The genes we inherit from our parents influence our risk for almost all diseases, from cancer to severe infections. With the explosion of genomic technologies, we are now able to use an individual’s genome to make useful predictions about future disease risk. However, recent work has shown that the predictive value of genetic information varies by context, including age, sex and ethnicity. In this paper we introduce, validate and apply new statistical methods for investigating the relationship between age and the contributions of genetic risk. These methods allow us to ask questions such as whether relative risk is constant over time, precisely how relative risk changes over time and whether all genetic risk factors have similar age profiles. By applying the methods to data from the UK Biobank, a prospective study of 500,000 people, we show that there is a tendency for genetic relative risk to decline with increasing age. We consider a series of possible explanations for the observation and conclude that there must be processes acting that we are currently unaware of, such as distinct phases of life in which genetic risk manifests itself, or interactions between genes and the environment.
Collapse
|
12
|
Elucidating relationships between P.falciparum prevalence and measures of genetic diversity with a combined genetic-epidemiological model of malaria. PLoS Comput Biol 2021; 17:e1009287. [PMID: 34411093 PMCID: PMC8407561 DOI: 10.1371/journal.pcbi.1009287] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Revised: 08/31/2021] [Accepted: 07/19/2021] [Indexed: 12/05/2022] Open
Abstract
There is an abundance of malaria genetic data being collected from the field, yet using these data to understand the drivers of regional epidemiology remains a challenge. A key issue is the lack of models that relate parasite genetic diversity to epidemiological parameters. Classical models in population genetics characterize changes in genetic diversity in relation to demographic parameters, but fail to account for the unique features of the malaria life cycle. In contrast, epidemiological models, such as the Ross-Macdonald model, capture malaria transmission dynamics but do not consider genetics. Here, we have developed an integrated model encompassing both parasite evolution and regional epidemiology. We achieve this by combining the Ross-Macdonald model with an intra-host continuous-time Moran model, thus explicitly representing the evolution of individual parasite genomes in a traditional epidemiological framework. Implemented as a stochastic simulation, we use the model to explore relationships between measures of parasite genetic diversity and parasite prevalence, a widely-used metric of transmission intensity. First, we explore how varying parasite prevalence influences genetic diversity at equilibrium. We find that multiple genetic diversity statistics are correlated with prevalence, but the strength of the relationships depends on whether variation in prevalence is driven by host- or vector-related factors. Next, we assess the responsiveness of a variety of statistics to malaria control interventions, finding that those related to mixed infections respond quickly (∼months) whereas other statistics, such as nucleotide diversity, may take decades to respond. These findings provide insights into the opportunities and challenges associated with using genetic data to monitor malaria epidemiology.
Collapse
|
13
|
Validation of an Integrated Risk Tool, Including Polygenic Risk Score, for Atherosclerotic Cardiovascular Disease in Multiple Ethnicities and Ancestries. Am J Cardiol 2021; 148:157-164. [PMID: 33675770 DOI: 10.1016/j.amjcard.2021.02.032] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 02/12/2021] [Accepted: 02/23/2021] [Indexed: 12/21/2022]
Abstract
The American College of Cardiology / American Heart Association pooled cohort equations tool (ASCVD-PCE) is currently recommended to assess 10-year risk for atherosclerotic cardiovascular disease (ASCVD). ASCVD-PCE does not currently include genetic risk factors. Polygenic risk scores (PRSs) have been shown to offer a powerful new approach to measuring genetic risk for common diseases, including ASCVD, and to enhance risk prediction when combined with ASCVD-PCE. Most work to date, including the assessment of tools, has focused on performance in individuals of European ancestries. Here we present evidence for the clinical validation of a new integrated risk tool (IRT), ASCVD-IRT, which combines ASCVD-PCE with PRS to predict 10-year risk of ASCVD across diverse ethnicity and ancestry groups. We demonstrate improved predictive performance of ASCVD-IRT over ASCVD-PCE, not only in individuals of self-reported White ethnicities (net reclassification improvement [NRI]; with 95% confidence interval = 2.7% [1.1 to 4.2]) but also Black / African American / Black Caribbean / Black African (NRI = 2.5% [0.6-4.3]) and South Asian (Indian, Bangladeshi or Pakistani) ethnicities (NRI = 8.7% [3.1 to 14.4]). NRI confidence intervals were wider and included zero for ethnicities with smaller sample sizes, including Hispanic (NRI = 7.5% [-1.4 to 16.5]), but PRS effect sizes in these ethnicities were significant and of comparable size to those seen in individuals of White ethnicities. Comparable results were obtained when individuals were analyzed by genetically inferred ancestry. Together, these results validate the performance of ASCVD-IRT in multiple ethnicities and ancestries, and favor their generalization to all ethnicities and ancestries.
Collapse
|
14
|
Detection of simple and complex de novo mutations with multiple reference sequences. Genome Res 2020; 30:1154-1169. [PMID: 32817236 PMCID: PMC7462078 DOI: 10.1101/gr.255505.119] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 07/17/2020] [Indexed: 12/25/2022]
Abstract
The characterization of de novo mutations in regions of high sequence and structural diversity from whole-genome sequencing data remains highly challenging. Complex structural variants tend to arise in regions of high repetitiveness and low complexity, challenging both de novo assembly, in which short reads do not capture the long-range context required for resolution, and mapping approaches, in which improper alignment of reads to a reference genome that is highly diverged from that of the sample can lead to false or partial calls. Long-read technologies can potentially solve such problems but are currently unfeasible to use at scale. Here we present Corticall, a graph-based method that combines the advantages of multiple technologies and prior data sources to detect arbitrary classes of genetic variant. We construct multisample, colored de Bruijn graphs from short-read data for all samples, align long-read–derived haplotypes and multiple reference data sources to restore graph connectivity information, and call variants using graph path-finding algorithms and a model for simultaneous alignment and recombination. We validate and evaluate the approach using extensive simulations and use it to characterize the rate and spectrum of de novo mutation events in 119 progeny from four Plasmodium falciparum experimental crosses, using long-read data on the parents to inform reconstructions of the progeny and to detect several known and novel nonallelic homologous recombination events.
Collapse
|
15
|
HLA*LA-HLA typing from linearly projected graph alignments. Bioinformatics 2020; 35:4394-4396. [PMID: 30942877 PMCID: PMC6821427 DOI: 10.1093/bioinformatics/btz235] [Citation(s) in RCA: 67] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Revised: 02/26/2019] [Accepted: 04/02/2019] [Indexed: 11/13/2022] Open
Abstract
Summary HLA*LA implements a new graph alignment model for human leukocyte antigen (HLA) type inference, based on the projection of linear alignments onto a variation graph. It enables accurate HLA type inference from whole-genome (99% accuracy) and whole-exome (93% accuracy) Illumina data; from long-read Oxford Nanopore and Pacific Biosciences data (98% accuracy for whole-genome and targeted data) and from genome assemblies. Computational requirements for a typical sample vary between 0.7 and 14 CPU hours per sample. Availability and implementation HLA*LA is implemented in C++ and Perl and freely available as a bioconda package or from https://github.com/DiltheyLab/HLA-LA (GPL v3). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
16
|
Accounting for long-range correlations in genome-wide simulations of large cohorts. PLoS Genet 2020; 16:e1008619. [PMID: 32369493 PMCID: PMC7266353 DOI: 10.1371/journal.pgen.1008619] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Revised: 06/02/2020] [Accepted: 01/21/2020] [Indexed: 11/20/2022] Open
Abstract
Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used msprime now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and its assumptions that sample sizes are small and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when the sample size is large. We present a Wright-Fisher extension to msprime, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency can be maintained via a hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past. Coalescent theory has provided deep theoretical insight into patterns of human diversity. Implementations of coalescent models in simulation software such as ms have further provided tools to interpret thousands of genomic studies. Recent technical progress has allowed for a dramatic increase in the scale at which genomes can be both measured and simulated, opening up opportunities for a finer understanding of evolutionary biology. However, we show that coalescent simulations of long regions of the genome exhibit large biases in sample relatedness, distorting haplotype sharing and ancestry patterns in simulated cohorts. We trace these biases to basic assumptions of the coalescent model, and show how the assumptions can be relaxed to provide a better description of the observed patterns of genetic polymorphism at a fraction of the computational cost.
Collapse
|
17
|
Dating genomic variants and shared ancestry in population-scale sequencing data. PLoS Biol 2020; 18:e3000586. [PMID: 31951611 PMCID: PMC6992231 DOI: 10.1371/journal.pbio.3000586] [Citation(s) in RCA: 78] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 01/30/2020] [Accepted: 01/02/2020] [Indexed: 12/31/2022] Open
Abstract
The origin and fate of new mutations within species is the fundamental process underlying evolution. However, while much attention has been focused on characterizing the presence, frequency, and phenotypic impact of genetic variation, the evolutionary histories of most variants are largely unexplored. We have developed a nonparametric approach for estimating the date of origin of genetic variants in large-scale sequencing data sets. The accuracy and robustness of the approach is demonstrated through simulation. Using data from two publicly available human genomic diversity resources, we estimated the age of more than 45 million single-nucleotide polymorphisms (SNPs) in the human genome and release the Atlas of Variant Age as a public online database. We characterize the relationship between variant age and frequency in different geographical regions and demonstrate the value of age information in interpreting variants of functional and selective importance. Finally, we use allele age estimates to power a rapid approach for inferring the ancestry shared between individual genomes and to quantify genealogical relationships at different points in the past, as well as to describe and explore the evolutionary history of modern human populations.
Collapse
|
18
|
Identifying cross-disease components of genetic risk across hospital data in the UK Biobank. Nat Genet 2019; 52:126-134. [PMID: 31873298 PMCID: PMC6974401 DOI: 10.1038/s41588-019-0550-4] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 11/18/2019] [Indexed: 01/06/2023]
Abstract
Genetic risk factors frequently affect multiple common human diseases, providing insight into shared pathophysiological pathways and opportunities for therapeutic development. However, systematic identification of genetic profiles of disease risk is limited by the availability of both comprehensive clinical data on population-scale cohorts and the lack of suitable statistical methodology that can handle the scale of and differential power inherent in multi-phenotype data. Here, we develop a disease-agnostic approach to cluster genetic risk profiles for 3,025 genome-wide independent loci across 19,155 disease classification codes from 320,644 participants in the UK Biobank, representing a large and heterogeneous population. We identify 339 distinct disease association profiles and use multiple approaches to link clusters to underlying biological pathways. We show how clusters can decompose the variance and covariance in risk for disease, thereby identifying underlying biological processes and their impact. We demonstrate the use of clusters in defining disease relationships and their potential in informing therapeutic strategies.
Collapse
|
19
|
Genomic Analysis of Plasmodium vivax in Southern Ethiopia Reveals Selective Pressures in Multiple Parasite Mechanisms. J Infect Dis 2019; 220:1738-1749. [PMID: 30668735 PMCID: PMC6804337 DOI: 10.1093/infdis/jiz016] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 01/18/2019] [Indexed: 01/12/2023] Open
Abstract
The Horn of Africa harbors the largest reservoir of Plasmodium vivax in the continent. Most of sub-Saharan Africa has remained relatively vivax-free due to a high prevalence of the human Duffy-negative trait, but the emergence of strains able to invade Duffy-negative reticulocytes poses a major public health threat. We undertook the first population genomic investigation of P. vivax from the region, comparing the genomes of 24 Ethiopian isolates against data from Southeast Asia to identify important local adaptions. The prevalence of the Duffy binding protein amplification in Ethiopia was 79%, potentially reflecting adaptation to Duffy negativity. There was also evidence of selection in a region upstream of the chloroquine resistance transporter, a putative chloroquine-resistance determinant. Strong signals of selection were observed in genes involved in immune evasion and regulation of gene expression, highlighting the need for a multifaceted intervention approach to combat P. vivax in the region.
Collapse
|
20
|
The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. eLife 2019; 8:e40845. [PMID: 31298657 PMCID: PMC6684230 DOI: 10.7554/elife.40845] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Accepted: 07/10/2019] [Indexed: 02/07/2023] Open
Abstract
Individual malaria infections can carry multiple strains of Plasmodium falciparum with varying levels of relatedness. Yet, how local epidemiology affects the properties of such mixed infections remains unclear. Here, we develop an enhanced method for strain deconvolution from genome sequencing data, which estimates the number of strains, their proportions, identity-by-descent (IBD) profiles and individual haplotypes. Applying it to the Pf3k data set, we find that the rate of mixed infection varies from 29% to 63% across countries and that 51% of mixed infections involve more than two strains. Furthermore, we estimate that 47% of symptomatic dual infections contain sibling strains likely to have been co-transmitted from a single mosquito, and find evidence of mixed infections propagated over successive infection cycles. Finally, leveraging data from the Malaria Atlas Project, we find that prevalence correlates within Africa, but not Asia, with both the rate of mixed infection and the level of IBD.
Collapse
|
21
|
Mapping the drivers of within-host pathogen evolution using massive data sets. Nat Commun 2019; 10:3017. [PMID: 31289267 PMCID: PMC6616926 DOI: 10.1038/s41467-019-10724-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2017] [Accepted: 05/20/2019] [Indexed: 11/09/2022] Open
Abstract
Differences among hosts, resulting from genetic variation in the immune system or heterogeneity in drug treatment, can impact within-host pathogen evolution. Genetic association studies can potentially identify such interactions. However, extensive and correlated genetic population structure in hosts and pathogens presents a substantial risk of confounding analyses. Moreover, the multiple testing burden of interaction scanning can potentially limit power. We present a Bayesian approach for detecting host influences on pathogen evolution that exploits vast existing data sets of pathogen diversity to improve power and control for stratification. The approach models key processes, including recombination and selection, and identifies regions of the pathogen genome affected by host factors. Our simulations and empirical analysis of drug-induced selection on the HIV-1 genome show that the method recovers known associations and has superior precision-recall characteristics compared to other approaches. We build a high-resolution map of HLA-induced selection in the HIV-1 genome, identifying novel epitope-allele combinations.
Collapse
|
22
|
Abstract
The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.
Collapse
|
23
|
Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 2018; 34:2556-2565. [PMID: 29554215 PMCID: PMC6061703 DOI: 10.1093/bioinformatics/bty157] [Citation(s) in RCA: 46] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 11/25/2017] [Accepted: 03/14/2018] [Indexed: 12/27/2022] Open
Abstract
Motivation The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. Results We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. Availability and implementation Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
24
|
A point mutation in the ion conduction pore of AMPA receptor GRIA3 causes dramatically perturbed sleep patterns as well as intellectual disability. Hum Mol Genet 2018; 26:3869-3882. [PMID: 29016847 PMCID: PMC5639461 DOI: 10.1093/hmg/ddx270] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2017] [Accepted: 07/06/2017] [Indexed: 01/19/2023] Open
Abstract
The discovery of genetic variants influencing sleep patterns can shed light on the physiological processes underlying sleep. As part of a large clinical sequencing project, WGS500, we sequenced a family in which the two male children had severe developmental delay and a dramatically disturbed sleep-wake cycle, with very long wake and sleep durations, reaching up to 106-h awake and 48-h asleep. The most likely causal variant identified was a novel missense variant in the X-linked GRIA3 gene, which has been implicated in intellectual disability. GRIA3 encodes GluA3, a subunit of AMPA-type ionotropic glutamate receptors (AMPARs). The mutation (A653T) falls within the highly conserved transmembrane domain of the ion channel gate, immediately adjacent to the analogous residue in the Grid2 (glutamate receptor) gene, which is mutated in the mouse neurobehavioral mutant, Lurcher. In vitro, the GRIA3(A653T) mutation stabilizes the channel in a closed conformation, in contrast to Lurcher. We introduced the orthologous mutation into a mouse strain by CRISPR-Cas9 mutagenesis and found that hemizygous mutants displayed significant differences in the structure of their activity and sleep compared to wild-type littermates. Typically, mice are polyphasic, exhibiting multiple sleep bouts of sleep several minutes long within a 24-h period. The Gria3A653T mouse showed significantly fewer brief bouts of activity and sleep than the wild-types. Furthermore, Gria3A653T mice showed enhanced period lengthening under constant light compared to wild-type mice, suggesting an increased sensitivity to light. Our results suggest a role for GluA3 channel activity in the regulation of sleep behavior in both mice and humans.
Collapse
|
25
|
Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics 2018; 34:9-15. [PMID: 28961721 PMCID: PMC5870807 DOI: 10.1093/bioinformatics/btx530] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2017] [Revised: 07/14/2017] [Accepted: 08/21/2017] [Indexed: 12/20/2022] Open
Abstract
Motivation The presence of multiple infecting strains of the malarial parasite Plasmodium falciparum affects key phenotypic traits, including drug resistance and risk of severe disease. Advances in protocols and sequencing technology have made it possible to obtain high-coverage genome-wide sequencing data from blood samples and blood spots taken in the field. However, analyzing and interpreting such data is challenging because of the high rate of multiple infections present. Results We have developed a statistical method and implementation for deconvolving multiple genome sequences present in an individual with mixed infections. The software package DEploid uses haplotype structure within a reference panel of clonal isolates as a prior for haplotypes present in a given sample. It estimates the number of strains, their relative proportions and the haplotypes presented in a sample, allowing researchers to study multiple infection in malaria with an unprecedented level of detail. Availability and implementation The open source implementation DEploid is freely available at https://github.com/mcveanlab/DEploid under the conditions of the GPLv3 license. An R version is available at https://github.com/mcveanlab/DEploid-r. Contact joe.zhu@bdi.ox.ac.uk or gil.mcvean@bdi.ox.ac.uk. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
26
|
Resolving TYK2 locus genotype-to-phenotype differences in autoimmunity. Sci Transl Med 2017; 8:363ra149. [PMID: 27807284 DOI: 10.1126/scitranslmed.aag1974] [Citation(s) in RCA: 165] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2016] [Accepted: 10/14/2016] [Indexed: 01/08/2023]
Abstract
Thousands of genetic variants have been identified, which contribute to the development of complex diseases, but determining how to elucidate their biological consequences for translation into clinical benefit is challenging. Conflicting evidence regarding the functional impact of genetic variants in the tyrosine kinase 2 (TYK2) gene, which is differentially associated with common autoimmune diseases, currently obscures the potential of TYK2 as a therapeutic target. We aimed to resolve this conflict by performing genetic meta-analysis across disorders; subsequent molecular, cellular, in vivo, and structural functional follow-up; and epidemiological studies. Our data revealed a protective homozygous effect that defined a signaling optimum between autoimmunity and immunodeficiency and identified TYK2 as a potential drug target for certain common autoimmune disorders.
Collapse
|
27
|
Abstract
Cancer is characterised by complex somatically acquired genetic aberrations that manifest as intra-tumour and inter-tumour genetic heterogeneity and can lead to treatment resistance. In this case study, we characterise the genome-wide somatic mutation dynamics in a metastatic melanoma patient during therapy using low-input (50 ng) PCR-free whole genome sequencing of cell-free DNA from pre-treatment and post-relapse blood samples. We identify de novo tumour-specific somatic mutations from cell-free DNA, while the sequence context of single nucleotide variants showed the characteristic UV-damage mutation signature of melanoma. To investigate the behaviour of individual somatic mutations during proto-oncogene B-Raf -targeted and immune checkpoint inhibition, amplicon-based deep sequencing was used to verify and track frequencies of 212 single nucleotide variants at 10 distinct time points over 13 months of treatment. Under checkpoint inhibition therapy, we observed an increase in mutant allele frequencies indicating progression on therapy 88 days before clinical determination of non-response positron emission tomogrophy-computed tomography. We also revealed mutations from whole genome sequencing of cell-free DNA that were not present in the tissue biopsy, but that later contributed to relapse. Our findings have potential clinical applications where high quality tumour-tissue derived DNA is not available.
Collapse
|
28
|
Abstract
Expression of HLA-C varies widely across individuals in an allele-specific manner. This variation in expression can influence efficacy of the immune response, as shown for infectious and autoimmune diseases. MicroRNA binding partially influences differential HLA-C expression, but the additional contributing factors have remained undetermined. Here we use functional and structural analyses to demonstrate that HLA-C expression is modulated not just at the RNA level, but also at the protein level. Specifically, we show that variation in exons 2 and 3, which encode the α1/α2 domains, drives differential expression of HLA-C allomorphs at the cell surface by influencing the structure of the peptide-binding cleft and the diversity of peptides bound by the HLA-C molecules. Together with a phylogenetic analysis, these results highlight the diversity and long-term balancing selection of regulatory factors that modulate HLA-C expression.
Collapse
|
29
|
A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res 2016; 27:157-164. [PMID: 27903644 PMCID: PMC5204340 DOI: 10.1101/gr.210500.116] [Citation(s) in RCA: 216] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2016] [Accepted: 10/28/2016] [Indexed: 12/30/2022]
Abstract
Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased “Platinum” variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1–50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission (“nonplatinum”) revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.
Collapse
|
30
|
Abstract
Neuroinflammation is emerging as a central process in many neurological conditions, either as a causative factor or as a secondary response to nervous system insult. Understanding the causes and consequences of neuroinflammation could, therefore, provide insight that is needed to improve therapeutic interventions across many diseases. However, the complexity of the pathways involved necessitates the use of high-throughput approaches to extensively interrogate the process, and appropriate strategies to translate the data generated into clinical benefit. Use of 'big data' aims to generate, integrate and analyse large, heterogeneous datasets to provide in-depth insights into complex processes, and has the potential to unravel the complexities of neuroinflammation. Limitations in data analysis approaches currently prevent the full potential of big data being reached, but some aspects of big data are already yielding results. The implementation of 'omics' analyses in particular is becoming routine practice in biomedical research, and neuroimaging is producing large sets of complex data. In this Review, we evaluate the impact of the drive to collect and analyse big data on our understanding of neuroinflammation in disease. We describe the breadth of big data that are leading to an evolution in our understanding of this field, exemplify how these data are beginning to be of use in a clinical setting, and consider possible future directions.
Collapse
|
31
|
High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs. PLoS Comput Biol 2016; 12:e1005151. [PMID: 27792722 PMCID: PMC5085092 DOI: 10.1371/journal.pcbi.1005151] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2016] [Accepted: 09/18/2016] [Indexed: 01/04/2023] Open
Abstract
Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30–250 CPU hours per sample) remain a significant challenge to practical application. Determining an individual’s HLA type (the sequence of the exons of the HLA genes) is important in many areas of biomedical research. For example, HLA types shape immune epitope repertoires, which are relevant in cancer immunotherapy, and influence autoimmune and infectious disease risk. Whole-genome sequencing data, currently being generated for hundreds of thousands of individuals, contains the information necessary for HLA typing–but inferring accurate HLA types from these is a challenging problem. First, the HLA genes are the most polymorphic genes in the human genome; second, these genes and their variant alleles exhibit high degrees of sequence similarity (due to a shared evolutionary origin). This makes it difficult to establish which specific HLA gene a given observed sequencing read derives from. We show that this problem can be addressed using a Population Reference Graph (PRG): for each gene, the PRG contains not only the reference sequence but also variant alleles, thus enabling, using a novel sequence-to-graph mapping algorithm, the accurate mapping of reads to HLA genes. We also show that HLA*PRG, the algorithm implementing our approach, achieves–based on standard whole-genome sequencing data–accuracies comparable to those of specialized gold-standard methods. HLA*PRG is open source and freely available.
Collapse
|
32
|
Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res 2016; 26:1288-99. [PMID: 27531718 PMCID: PMC5052046 DOI: 10.1101/gr.203711.115] [Citation(s) in RCA: 121] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 06/28/2016] [Indexed: 12/14/2022]
Abstract
The malaria parasite Plasmodium falciparum has a great capacity for evolutionary adaptation to evade host immunity and develop drug resistance. Current understanding of parasite evolution is impeded by the fact that a large fraction of the genome is either highly repetitive or highly variable and thus difficult to analyze using short-read sequencing technologies. Here, we describe a resource of deep sequencing data on parents and progeny from genetic crosses, which has enabled us to perform the first genome-wide, integrated analysis of SNP, indel and complex polymorphisms, using Mendelian error rates as an indicator of genotypic accuracy. These data reveal that indels are exceptionally abundant, being more common than SNPs and thus the dominant mode of polymorphism within the core genome. We use the high density of SNP and indel markers to analyze patterns of meiotic recombination, confirming a high rate of crossover events and providing the first estimates for the rate of non-crossover events and the length of conversion tracts. We observe several instances of meiotic recombination within copy number variants associated with drug resistance, demonstrating a mechanism whereby fitness costs associated with resistance mutations could be compensated and greater phenotypic plasticity could be acquired.
Collapse
|
33
|
Premalignant SOX2 overexpression in the fallopian tubes of ovarian cancer patients: Discovery and validation studies. EBioMedicine 2016; 10:137-49. [PMID: 27492892 PMCID: PMC5006641 DOI: 10.1016/j.ebiom.2016.06.048] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Revised: 06/30/2016] [Accepted: 06/30/2016] [Indexed: 02/01/2023] Open
Abstract
Current screening methods for ovarian cancer can only detect advanced disease. Earlier detection has proved difficult because the molecular precursors involved in the natural history of the disease are unknown. To identify early driver mutations in ovarian cancer cells, we used dense whole genome sequencing of micrometastases and microscopic residual disease collected at three time points over three years from a single patient during treatment for high-grade serous ovarian cancer (HGSOC). The functional and clinical significance of the identified mutations was examined using a combination of population-based whole genome sequencing, targeted deep sequencing, multi-center analysis of protein expression, loss of function experiments in an in-vivo reporter assay and mammalian models, and gain of function experiments in primary cultured fallopian tube epithelial (FTE) cells. We identified frequent mutations involving a 40kb distal repressor region for the key stem cell differentiation gene SOX2. In the apparently normal FTE, the region was also mutated. This was associated with a profound increase in SOX2 expression (p<2(-16)), which was not found in patients without cancer (n=108). Importantly, we show that SOX2 overexpression in FTE is nearly ubiquitous in patients with HGSOCs (n=100), and common in BRCA1-BRCA2 mutation carriers (n=71) who underwent prophylactic salpingo-oophorectomy. We propose that the finding of SOX2 overexpression in FTE could be exploited to develop biomarkers for detecting disease at a premalignant stage, which would reduce mortality from this devastating disease.
Collapse
|
34
|
Recombination Rate Heterogeneity within Arabidopsis Disease Resistance Genes. PLoS Genet 2016; 12:e1006179. [PMID: 27415776 PMCID: PMC4945094 DOI: 10.1371/journal.pgen.1006179] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2015] [Accepted: 06/15/2016] [Indexed: 12/31/2022] Open
Abstract
Meiotic crossover frequency varies extensively along chromosomes and is typically concentrated in hotspots. As recombination increases genetic diversity, hotspots are predicted to occur at immunity genes, where variation may be beneficial. A major component of plant immunity is recognition of pathogen Avirulence (Avr) effectors by resistance (R) genes that encode NBS-LRR domain proteins. Therefore, we sought to test whether NBS-LRR genes would overlap with meiotic crossover hotspots using experimental genetics in Arabidopsis thaliana. NBS-LRR genes tend to physically cluster in plant genomes; for example, in Arabidopsis most are located in large clusters on the south arms of chromosomes 1 and 5. We experimentally mapped 1,439 crossovers within these clusters and observed NBS-LRR gene associated hotspots, which were also detected as historical hotspots via analysis of linkage disequilibrium. However, we also observed NBS-LRR gene coldspots, which in some cases correlate with structural heterozygosity. To study recombination at the fine-scale we used high-throughput sequencing to analyze ~1,000 crossovers within the RESISTANCE TO ALBUGO CANDIDA1 (RAC1) R gene hotspot. This revealed elevated intragenic crossovers, overlapping nucleosome-occupied exons that encode the TIR, NBS and LRR domains. The highest RAC1 recombination frequency was promoter-proximal and overlapped CTT-repeat DNA sequence motifs, which have previously been associated with plant crossover hotspots. Additionally, we show a significant influence of natural genetic variation on NBS-LRR cluster recombination rates, using crosses between Arabidopsis ecotypes. In conclusion, we show that a subset of NBS-LRR genes are strong hotspots, whereas others are coldspots. This reveals a complex recombination landscape in Arabidopsis NBS-LRR genes, which we propose results from varying coevolutionary pressures exerted by host-pathogen relationships, and is influenced by structural heterozygosity.
Collapse
|
35
|
A Method to Exploit the Structure of Genetic Ancestry Space to Enhance Case-Control Studies. Am J Hum Genet 2016; 98:857-868. [PMID: 27087321 DOI: 10.1016/j.ajhg.2016.02.025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Accepted: 02/29/2016] [Indexed: 02/08/2023] Open
Abstract
One goal of human genetics is to understand the genetic basis of disease, a challenge for diseases of complex inheritance because risk alleles are few relative to the vast set of benign variants. Risk variants are often sought by association studies in which allele frequencies in case subjects are contrasted with those from population-based samples used as control subjects. In an ideal world we would know population-level allele frequencies, releasing researchers to focus on case subjects. We argue this ideal is possible, at least theoretically, and we outline a path to achieving it in reality. If such a resource were to exist, it would yield ample savings and would facilitate the effective use of data repositories by removing administrative and technical barriers. We call this concept the Universal Control Repository Network (UNICORN), a means to perform association analyses without necessitating direct access to individual-level control data. Our approach to UNICORN uses existing genetic resources and various statistical tools to analyze these data, including hierarchical clustering with spectral analysis of ancestry; and empirical Bayesian analysis along with Gaussian spatial processes to estimate ancestry-specific allele frequencies. We demonstrate our approach using tens of thousands of control subjects from studies of Crohn disease, showing how it controls false positives, provides power similar to that achieved when all control data are directly accessible, and enhances power when control data are limiting or even imperfectly matched ancestrally. These results highlight how UNICORN can enable reliable, powerful, and convenient genetic association analyses without access to the individual-level data.
Collapse
|
36
|
Corrigendum: Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat Commun 2016; 7:11465. [PMID: 27095245 PMCID: PMC4843104 DOI: 10.1038/ncomms11465] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
|
37
|
A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference. LECTURE NOTES IN COMPUTER SCIENCE 2016. [DOI: 10.1007/978-3-319-43681-4_18] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
38
|
Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat Commun 2015; 6:10063. [PMID: 26686880 PMCID: PMC4703848 DOI: 10.1038/ncomms10063] [Citation(s) in RCA: 353] [Impact Index Per Article: 39.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 10/28/2015] [Indexed: 01/14/2023] Open
Abstract
The rise of antibiotic-resistant bacteria has led to an urgent need for rapid detection of drug resistance in clinical samples, and improvements in global surveillance. Here we show how de Bruijn graph representation of bacterial diversity can be used to identify species and resistance profiles of clinical isolates. We implement this method for Staphylococcus aureus and Mycobacterium tuberculosis in a software package ('Mykrobe predictor') that takes raw sequence data as input, and generates a clinician-friendly report within 3 minutes on a laptop. For S. aureus, the error rates of our method are comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.1%/99.6% across 12 antibiotics (using an independent validation set, n=470). For M. tuberculosis, our method predicts resistance with sensitivity/specificity of 82.6%/98.5% (independent validation set, n=1,609); sensitivity is lower here, probably because of limited understanding of the underlying genetic mechanisms. We give evidence that minor alleles improve detection of extremely drug-resistant strains, and demonstrate feasibility of the use of emerging single-molecule nanopore sequencing techniques for these purposes.
Collapse
|
39
|
Abstract
The DNA-binding protein PRDM9 has a critical role in specifying meiotic recombination hotspots in mice and apes, but it appears to be absent from other vertebrate species, including birds. To study the evolution and determinants of recombination in species lacking the gene that encodes PRDM9, we inferred fine-scale genetic maps from population resequencing data for two bird species: the zebra finch, Taeniopygia guttata, and the long-tailed finch, Poephila acuticauda. We found that both species have recombination hotspots, which are enriched near functional genomic elements. Unlike in mice and apes, most hotspots are shared between the two species, and their conservation seems to extend over tens of millions of years. These observations suggest that in the absence of PRDM9, recombination targets functional features that both enable access to the genome and constrain its evolution.
Collapse
|
40
|
Abstract
The last few decades have utterly transformed genetics and genomics, but what might the next ten years bring? PLOS Biology asked eight leaders spanning a range of related areas to give us their predictions. Without exception, the predictions are for more data on a massive scale and of more diverse types. All are optimistic and predict enormous positive impact on scientific understanding, while a recurring theme is the benefit of such data for the transformation and personalization of medicine. Several also point out that the biggest changes will very likely be those that we don’t foresee, even now. The last few decades have utterly transformed genetics and genomics, but what might the next ten years bring? In this Perspective, eight leaders, spanning a range of related areas, give us their predictions.
Collapse
|
41
|
Improved genome inference in the MHC using a population reference graph. Nat Genet 2015; 47:682-8. [PMID: 25915597 PMCID: PMC4449272 DOI: 10.1038/ng.3257] [Citation(s) in RCA: 115] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Accepted: 03/03/2015] [Indexed: 12/21/2022]
Abstract
Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.
Collapse
|
42
|
Comprehensive genome-wide evaluation of lapatinib-induced liver injury yields a single genetic signal centered on known risk allele HLA-DRB1*07:01. THE PHARMACOGENOMICS JOURNAL 2015; 16:180-5. [PMID: 25987243 PMCID: PMC4819766 DOI: 10.1038/tpj.2015.40] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Revised: 02/13/2015] [Accepted: 03/26/2015] [Indexed: 01/11/2023]
Abstract
Lapatinib is associated with a low incidence of serious liver injury. Previous investigations have identified and confirmed the Class II allele HLA-DRB1*07:01 to be strongly associated with lapatinib-induced liver injury; however, the moderate positive predictive value limits its clinical utility. To assess whether additional genetic variants located within the major histocompatibility complex locus or elsewhere in the genome may influence lapatinib-induced liver injury risk, and potentially lead to a genetic association with improved predictive qualities, we have taken two approaches: a genome-wide association study and a whole-genome sequencing study. This evaluation did not reveal additional associations other than the previously identified association for HLA-DRB1*07:01. The present study represents the most comprehensive genetic evaluation of drug-induced liver injury (DILI) or hypersensitivity, and suggests that investigation of possible human leukocyte antigen associations with DILI and other hypersensitivities represents an important first step in understanding the mechanism of these events.
Collapse
|
43
|
Genetic characterization of Greek population isolates reveals strong genetic drift at missense and trait-associated variants. Nat Commun 2014; 5:5345. [PMID: 25373335 PMCID: PMC4242463 DOI: 10.1038/ncomms6345] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Accepted: 09/22/2014] [Indexed: 11/09/2022] Open
Abstract
Isolated populations are emerging as a powerful study design in the search for low-frequency and rare variant associations with complex phenotypes. Here we genotype 2,296 samples from two isolated Greek populations, the Pomak villages (HELIC-Pomak) in the North of Greece and the Mylopotamos villages (HELIC-MANOLIS) in Crete. We compare their genomic characteristics to the general Greek population and establish them as genetic isolates. In the MANOLIS cohort, we observe an enrichment of missense variants among the variants that have drifted up in frequency by more than fivefold. In the Pomak cohort, we find novel associations at variants on chr11p15.4 showing large allele frequency increases (from 0.2% in the general Greek population to 4.6% in the isolate) with haematological traits, for example, with mean corpuscular volume (rs7116019, P=2.3 × 10(-26)). We replicate this association in a second set of Pomak samples (combined P=2.0 × 10(-36)). We demonstrate significant power gains in detecting medical trait associations.
Collapse
|
44
|
Abstract
Large whole-genome sequencing projects have provided access to much rare variation in human populations, which is highly informative about population structure and recent demography. Here, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how these ages can be related to historical relationships between populations. We investigate the distribution of the age of variants occurring exactly twice (ƒ(2) variants) in a worldwide sample sequenced by the 1000 Genomes Project, revealing enormous variation across populations. The median age of haplotypes carrying ƒ(2) variants is 50 to 160 generations across populations within Europe or Asia, and 170 to 320 generations within Africa. Haplotypes shared between continents are much older with median ages for haplotypes shared between Europe and Asia ranging from 320 to 670 generations. The distribution of the ages of ƒ(2) haplotypes is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the effect of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.
Collapse
|
45
|
Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet 2014; 46:912-918. [PMID: 25017105 DOI: 10.1038/ng.3036] [Citation(s) in RCA: 689] [Impact Index Per Article: 68.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2013] [Accepted: 06/23/2014] [Indexed: 12/19/2022]
Abstract
High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.
Collapse
|
46
|
Abstract
Germline mutation determines rates of molecular evolution, genetic diversity, and fitness load. In humans, the average point mutation rate is 1.2 × 10(-8) per base pair per generation, with every additional year of father's age contributing two mutations across the genome and males contributing three to four times as many mutations as females. To assess whether such patterns are shared with our closest living relatives, we sequenced the genomes of a nine-member pedigree of Western chimpanzees, Pan troglodytes verus. Our results indicate a mutation rate of 1.2 × 10(-8) per base pair per generation, but a male contribution seven to eight times that of females and a paternal age effect of three mutations per year of father's age. Thus, mutation rates and patterns differ between closely related species.
Collapse
|
47
|
Clinical whole-genome sequencing in severe early-onset epilepsy reveals new genes and improves molecular diagnosis. Hum Mol Genet 2014; 23:3200-11. [PMID: 24463883 PMCID: PMC4030775 DOI: 10.1093/hmg/ddu030] [Citation(s) in RCA: 185] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In severe early-onset epilepsy, precise clinical and molecular genetic diagnosis is complex, as many metabolic and electro-physiological processes have been implicated in disease causation. The clinical phenotypes share many features such as complex seizure types and developmental delay. Molecular diagnosis has historically been confined to sequential testing of candidate genes known to be associated with specific sub-phenotypes, but the diagnostic yield of this approach can be low. We conducted whole-genome sequencing (WGS) on six patients with severe early-onset epilepsy who had previously been refractory to molecular diagnosis, and their parents. Four of these patients had a clinical diagnosis of Ohtahara Syndrome (OS) and two patients had severe non-syndromic early-onset epilepsy (NSEOE). In two OS cases, we found de novo non-synonymous mutations in the genes KCNQ2 and SCN2A. In a third OS case, WGS revealed paternal isodisomy for chromosome 9, leading to identification of the causal homozygous missense variant in KCNT1, which produced a substantial increase in potassium channel current. The fourth OS patient had a recessive mutation in PIGQ that led to exon skipping and defective glycophosphatidyl inositol biosynthesis. The two patients with NSEOE had likely pathogenic de novo mutations in CBL and CSNK1G1, respectively. Mutations in these genes were not found among 500 additional individuals with epilepsy. This work reveals two novel genes for OS, KCNT1 and PIGQ. It also uncovers unexpected genetic mechanisms and emphasizes the power of WGS as a clinical tool for making molecular diagnoses, particularly for highly heterogeneous disorders.
Collapse
|
48
|
Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat Genet 2013; 45:1353-60. [PMID: 24076602 PMCID: PMC3832895 DOI: 10.1038/ng.2770] [Citation(s) in RCA: 980] [Impact Index Per Article: 89.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Accepted: 09/03/2013] [Indexed: 12/13/2022]
Abstract
Using the ImmunoChip custom genotyping array, we analyzed 14,498 subjects with multiple sclerosis and 24,091 healthy controls for 161,311 autosomal variants and identified 135 potentially associated regions (P < 1.0 × 10(-4)). In a replication phase, we combined these data with previous genome-wide association study (GWAS) data from an independent 14,802 subjects with multiple sclerosis and 26,703 healthy controls. In these 80,094 individuals of European ancestry, we identified 48 new susceptibility variants (P < 5.0 × 10(-8)), 3 of which we found after conditioning on previously identified variants. Thus, there are now 110 established multiple sclerosis risk variants at 103 discrete loci outside of the major histocompatibility complex. With high-resolution Bayesian fine mapping, we identified five regions where one variant accounted for more than 50% of the posterior probability of association. This study enhances the catalog of multiple sclerosis risk variants and illustrates the value of fine mapping in the resolution of GWAS signals.
Collapse
|
49
|
Arabidopsis meiotic crossover hot spots overlap with H2A.Z nucleosomes at gene promoters. Nat Genet 2013; 45:1327-36. [PMID: 24056716 PMCID: PMC3812125 DOI: 10.1038/ng.2766] [Citation(s) in RCA: 247] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2013] [Accepted: 08/26/2013] [Indexed: 12/13/2022]
Abstract
PRDM9 directs human meiotic crossover hot spots to intergenic sequence motifs, whereas budding yeast hot spots overlap regions of low nucleosome density (LND) in gene promoters. To investigate hot spots in plants, which lack PRDM9, we used coalescent analysis of genetic variation in Arabidopsis thaliana. Crossovers increased toward gene promoters and terminators, and hot spots were associated with active chromatin modifications, including H2A.Z, histone H3 Lys4 trimethylation (H3K4me3), LND and low DNA methylation. Hot spot-enriched A-rich and CTT-repeat DNA motifs occurred upstream and downstream, respectively, of transcriptional start sites. Crossovers were asymmetric around promoters and were most frequent over CTT-repeat motifs and H2A.Z nucleosomes. Pollen typing, segregation and cytogenetic analysis showed decreased numbers of crossovers in the arp6 H2A.Z deposition mutant at multiple scales. During meiosis, H2A.Z forms overlapping chromosomal foci with the DMC1 and RAD51 recombinases. As arp6 reduced the number of DMC1 or RAD51 foci, H2A.Z may promote the formation or processing of meiotic DNA double-strand breaks. We propose that gene chromatin ancestrally designates hot spots within eukaryotes and PRDM9 is a derived state within vertebrates.
Collapse
|
50
|
Hypervariable antigen genes in malaria have ancient roots. BMC Evol Biol 2013; 13:110. [PMID: 23725540 PMCID: PMC3680017 DOI: 10.1186/1471-2148-13-110] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2012] [Accepted: 05/06/2013] [Indexed: 01/07/2023] Open
Abstract
Background The var genes of the human malaria parasite Plasmodium falciparum are highly polymorphic loci coding for the erythrocyte membrane proteins 1 (PfEMP1), which are responsible for the cytoaherence of P. falciparum infected red blood cells to the human vasculature. Cytoadhesion, coupled with differential expression of var genes, contributes to virulence and allows the parasite to establish chronic infections by evading detection from the host’s immune system. Although studying genetic diversity is a major focus of recent work on the var genes, little is known about the gene family's origin and evolutionary history. Results Using a novel hidden Markov model-based approach and var sequences assembled from additional isolates and species, we are able to reveal elements of both the early evolution of the var genes as well as recent diversifying events. We compare sequences of the var gene DBLα domains from divergent isolates of P. falciparum (3D7 and HB3), and a closely-related species, Plasmodium reichenowi. We find that the gene family is equally large in P. reichenowi and P. falciparum -- with a minimum of 51 var genes in the P. reichenowi genome (compared to 61 in 3D7 and a minimum of 48 in HB3). In addition, we are able to define large, continuous blocks of homologous sequence among P. falciparum and P. reichenowi var gene DBLα domains. These results reveal that the contemporary structure of the var gene family was present before the divergence of P. falciparum and P. reichenowi, estimated to be between 2.5 to 6 million years ago. We also reveal that recombination has played an important and traceable role in both the establishment, and the maintenance, of diversity in the sequences. Conclusions Despite the remarkable diversity and rapid evolution found in these loci within and among P. falciparum populations, the basic structure of these domains and the gene family is surprisingly old and stable. Revealing a common structure as well as conserved sequence among two species also has implications for developing new primate-parasite models for studying the pathology and immunology of falciparum malaria, and for studying the population genetics of var genes and associated virulence phenotypes.
Collapse
|