1
|
Chrisman BS, Paskov KM, He C, Jung JY, Stockham N, Washington PY, Wall DP. A Method for Localizing Non-Reference Sequences to the Human Genome. Pac Symp Biocomput 2022; 27:313-324. [PMID: 34890159 PMCID: PMC8730539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.
Collapse
Affiliation(s)
| | - Kelley M Paskov
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Chloe He
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Jae-Yoon Jung
- Department of Pediatrics, Stanford University, Stanford, CA 94305, USA
| | - Nate Stockham
- Department of Neuroscience, Stanford University, Stanford, CA 94305, USA
| | | | - Dennis Paul Wall
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Pediatrics, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
2
|
Chrisman BS, Paskov K, Stockham N, Tabatabaei K, Jung JY, Washington P, Varma M, Sun MW, Maleki S, Wall DP. Indels in SARS-CoV-2 occur at template-switching hotspots. BioData Min 2021; 14:20. [PMID: 33743803 PMCID: PMC7980745 DOI: 10.1186/s13040-021-00251-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 02/23/2021] [Indexed: 11/10/2022] Open
Abstract
The evolutionary dynamics of SARS-CoV-2 have been carefully monitored since the COVID-19 pandemic began in December 2019. However, analysis has focused primarily on single nucleotide polymorphisms and largely ignored the role of insertions and deletions (indels) as well as recombination in SARS-CoV-2 evolution. Using sequences from the GISAID database, we catalogue over 100 insertions and deletions in the SARS-CoV-2 consensus sequences. We hypothesize that these indels are artifacts of recombination events between SARS-CoV-2 replicates whereby RNA-dependent RNA polymerase (RdRp) re-associates with a homologous template at a different loci ("imperfect homologous recombination"). We provide several independent pieces of evidence that suggest this. (1) The indels from the GISAID consensus sequences are clustered at specific regions of the genome. (2) These regions are also enriched for 5' and 3' breakpoints in the transcription regulatory site (TRS) independent transcriptome, presumably sites of RNA-dependent RNA polymerase (RdRp) template-switching. (3) Within raw reads, these indel hotspots have cases of both high intra-host heterogeneity and intra-host homogeneity, suggesting that these indels are both consequences of de novo recombination events within a host and artifacts of previous recombination. We briefly analyze the indels in the context of RNA secondary structure, noting that indels preferentially occur in "arms" and loop structures of the predicted folded RNA, suggesting that secondary structure may be a mechanism for TRS-independent template-switching in SARS-CoV-2 or other coronaviruses. These insights into the relationship between structural variation and recombination in SARS-CoV-2 can improve our reconstructions of the SARS-CoV-2 evolutionary history as well as our understanding of the process of RdRp template-switching in RNA viruses.
Collapse
Affiliation(s)
| | - Kelley Paskov
- Department of Biomedical Data Science, Stanford University, Stanford, USA
| | - Nate Stockham
- Department of Neuroscience, Stanford University, Stanford, USA
| | - Kevin Tabatabaei
- Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - Jae-Yoon Jung
- Department of Biomedical Data Science, Stanford University, Stanford, USA
| | - Peter Washington
- Department of Bioengineering, Stanford University, Stanford, USA
| | - Maya Varma
- Department of Computer Science, Stanford University, Stanford, USA
| | - Min Woo Sun
- Department of Biomedical Data Science, Stanford University, Stanford, USA
| | - Sepideh Maleki
- Department of Computer Science, University of Texas Austin, Austin, USA
| | - Dennis P Wall
- Department of Biomedical Data Science, Stanford University, Stanford, USA.
- Department of Pediatrics (Systems Medicine), Stanford University, Stanford, USA.
| |
Collapse
|
3
|
Campbell KS, Chrisman BS, Campbell SG. Multiscale Modeling of Cardiovascular Function Predicts That the End-Systolic Pressure Volume Relationship Can Be Targeted via Multiple Therapeutic Strategies. Front Physiol 2020; 11:1043. [PMID: 32973561 PMCID: PMC7466769 DOI: 10.3389/fphys.2020.01043] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2019] [Accepted: 07/29/2020] [Indexed: 01/01/2023] Open
Abstract
Most patients who develop heart failure are unable to elevate their cardiac output on demand due to impaired contractility and/or reduced ventricular filling. Despite decades of research, few effective therapies for heart failure have been developed. In part, this may reflect the difficulty of predicting how perturbations to molecular-level mechanisms that are induced by drugs will scale up to modulate system-level properties such as blood pressure. Computer modeling might help with this process and thereby accelerate the development of better therapies for heart failure. This manuscript presents a new multiscale model that uses a single contractile element to drive an idealized ventricle that pumps blood around a closed circulation. The contractile element was formed by linking an existing model of dynamically coupled myofilaments with a well-established model of myocyte electrophysiology. The resulting framework spans from molecular-level events (including opening of ion channels and transitions between different myosin states) to properties such as ejection fraction that can be measured in patients. Initial calculations showed that the model reproduces many aspects of normal cardiovascular physiology including, for example, pressure-volume loops. Subsequent sensitivity tests then quantified how each model parameter influenced a range of system level properties. The first key finding was that the End Systolic Pressure Volume Relationship, a classic index of cardiac contractility, was ∼50% more sensitive to parameter changes than any other system-level property. The second important result was that parameters that primarily affect ventricular filling, such as passive stiffness and Ca2+ reuptake via sarco/endoplasmic reticulum Ca2+-ATPase (SERCA), also have a major impact on systolic properties including stroke work, myosin ATPase, and maximum ventricular pressure. These results reinforce the impact of diastolic function on ventricular performance and identify the End Systolic Pressure Volume Relationship as a particularly sensitive system-level property that can be targeted using multiple therapeutic strategies.
Collapse
Affiliation(s)
- Kenneth S Campbell
- Division of Cardiovascular Medicine, Department of Physiology, University of Kentucky, Lexington, KY, United States
| | | | - Stuart G Campbell
- Department of Biomedical Engineering, Yale University, New Haven, CT, United States
| |
Collapse
|
4
|
Varma M, Paskov KM, Jung JY, Chrisman BS, Stockham NT, Washington PY, Wall DP. Outgroup Machine Learning Approach Identifies Single Nucleotide Variants in Noncoding DNA Associated with Autism Spectrum Disorder. Pac Symp Biocomput 2019; 24:260-271. [PMID: 30864328 PMCID: PMC6417813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Autism spectrum disorder (ASD) is a heritable neurodevelopmental disorder affecting 1 in 59 children. While noncoding genetic variation has been shown to play a major role in many complex disorders, the contribution of these regions to ASD susceptibility remains unclear. Genetic analyses of ASD typically use unaffected family members as controls; however, we hypothesize that this method does not effectively elevate variant signal in the noncoding region due to family members having subclinical phenotypes arising from common genetic mechanisms. In this study, we use a separate, unrelated outgroup of individuals with progressive supranuclear palsy (PSP), a neurodegenerative condition with no known etiological overlap with ASD, as a control population. We use whole genome sequencing data from a large cohort of 2182 children with ASD and 379 controls with PSP, sequenced at the same facility with the same machines and variant calling pipeline, in order to investigate the role of noncoding variation in the ASD phenotype. We analyze seven major types of noncoding variants: microRNAs, human accelerated regions, hypersensitive sites, transcription factor binding sites, DNA repeat sequences, simple repeat sequences, and CpG islands. After identifying and removing batch effects between the two groups, we trained an ℓ1-regularized logistic regression classifier to predict ASD status from each set of variants. The classifier trained on simple repeat sequences performed well on a held-out test set (AUC-ROC = 0.960); this classifier was also able to differentiate ASD cases from controls when applied to a completely independent dataset (AUC-ROC = 0.960). This suggests that variation in simple repeat regions is predictive of the ASD phenotype and may contribute to ASD risk. Our results show the importance of the noncoding region and the utility of independent control groups in effectively linking genetic variation to disease phenotype for complex disorders.
Collapse
Affiliation(s)
- Maya Varma
- Departments of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Kelley Marie Paskov
- Departments of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Jae-Yoon Jung
- Departments of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Departments of Pediatrics, Stanford University, Stanford, CA 94305, USA
| | | | | | | | - Dennis Paul Wall
- Departments of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Departments of Pediatrics, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|