1
|
Hahn G, Lutz SM, Hecker J, Prokopenko D, Cho MH, Silverman EK, Weiss ST, Lange C. Fast computation of the eigensystem of genomic similarity matrices. BMC Bioinformatics 2024; 25:43. [PMID: 38273228 PMCID: PMC10811951 DOI: 10.1186/s12859-024-05650-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 01/11/2024] [Indexed: 01/27/2024] Open
Abstract
The computation of a similarity measure for genomic data is a standard tool in computational genetics. The principal components of such matrices are routinely used to correct for biases due to confounding by population stratification, for instance in linear regressions. However, the calculation of both a similarity matrix and its singular value decomposition (SVD) are computationally intensive. The contribution of this article is threefold. First, we demonstrate that the calculation of three matrices (called the covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be reformulated in a unified way which allows for the application of a randomized SVD algorithm, which is faster than the traditional computation. The fast SVD algorithm we present is adapted from an existing randomized SVD algorithm and ensures that all computations are carried out in sparse matrix algebra. The algorithm only assumes that row-wise and column-wise subtraction and multiplication of a vector with a sparse matrix is available, an operation that is efficiently implemented in common sparse matrix packages. An exception is the so-called Jaccard matrix, which does not have a structure applicable for the fast SVD algorithm. Second, an approximate Jaccard matrix is introduced to which the fast SVD computation is applicable. Third, we establish guaranteed theoretical bounds on the accuracy (in [Formula: see text] norm and angle) between the principal components of the Jaccard matrix and the ones of our proposed approximation, thus putting the proposed Jaccard approximation on a solid mathematical foundation, and derive the theoretical runtime of our algorithm. We illustrate that the approximation error is low in practice and empirically verify the theoretical runtime scalings on both simulated data and data of the 1000 Genome Project.
Collapse
Affiliation(s)
- Georg Hahn
- T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA.
| | - Sharon M Lutz
- T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA
| | - Julian Hecker
- Channing Divsion of Network Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Dmitry Prokopenko
- Massachusetts General Hospital, Harvard University, Boston, MA, 02114, USA
| | - Michael H Cho
- Channing Divsion of Network Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Edwin K Silverman
- Channing Divsion of Network Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Scott T Weiss
- Channing Divsion of Network Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Christoph Lange
- T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA
| |
Collapse
|
2
|
Lee S, Hahn G, Hecker J, Lutz SM, Mullin K, Alzheimer’s Disease Neuroimaging Initiative (ADNI), Hide W, Bertram L, DeMeo DL, Tanzi RE, Lange C, Prokopenko D. A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets. Brief Bioinform 2023; 24:bbac611. [PMID: 36585781 PMCID: PMC9851291 DOI: 10.1093/bib/bbac611] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 12/07/2022] [Accepted: 12/11/2022] [Indexed: 01/01/2023] Open
Abstract
Genetic similarity matrices are commonly used to assess population substructure (PS) in genetic studies. Through simulation studies and by the application to whole-genome sequencing (WGS) data, we evaluate the performance of three genetic similarity matrices: the unweighted and weighted Jaccard similarity matrices and the genetic relationship matrix. We describe different scenarios that can create numerical pitfalls and lead to incorrect conclusions in some instances. We consider scenarios in which PS is assessed based on loci that are located across the genome ('globally') and based on loci from a specific genomic region ('locally'). We also compare scenarios in which PS is evaluated based on loci from different minor allele frequency bins: common (>5%), low-frequency (5-0.5%) and rare (<0.5%) single-nucleotide variations (SNVs). Overall, we observe that all approaches provide the best clustering performance when computed based on rare SNVs. The performance of the similarity matrices is very similar for common and low-frequency variants, but for rare variants, the unweighted Jaccard matrix provides preferable clustering features. Based on visual inspection and in terms of standard clustering metrics, its clusters are the densest and the best separated in the principal component analysis of variants with rare SNVs compared with the other methods and different allele frequency cutoffs. In an application, we assessed the role of rare variants on local and global PS, using WGS data from multiethnic Alzheimer's disease data sets and European or East Asian populations from the 1000 Genome Project.
Collapse
Affiliation(s)
- Sanghun Lee
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Department of Medical Consilience, Division of Medicine, Graduate school, Dankook University, South Korea
- NH Institute for Natural Product Research, Myungji Hospital, South Korea
| | - Georg Hahn
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Julian Hecker
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Sharon M Lutz
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Department of Population Medicine, Harvard Pilgrim Health Care Institute, Boston, MA, USA
| | - Kristina Mullin
- Genetics and Aging Unit and McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| | | | - Winston Hide
- Harvard Medical School, Boston, MA, USA
- Department of Pathology, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Lars Bertram
- Lübeck Interdisciplinary Platform for Genome Analytics, University of Lübeck, Lübeck, Germany
- Department of Psychology, University of Oslo, Oslo, Norway
| | - Dawn L DeMeo
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Rudolph E Tanzi
- Harvard Medical School, Boston, MA, USA
- Genetics and Aging Unit and McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| | - Christoph Lange
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Dmitry Prokopenko
- Harvard Medical School, Boston, MA, USA
- Genetics and Aging Unit and McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| |
Collapse
|
3
|
Hahn G, Lee S, Prokopenko D, Abraham J, Novak T, Hecker J, Cho M, Khurana S, Baden LR, Randolph AG, Weiss ST, Lange C. Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest. BMC Bioinformatics 2022; 23:547. [PMID: 36536276 PMCID: PMC9761049 DOI: 10.1186/s12859-022-05105-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Accepted: 12/07/2022] [Indexed: 12/23/2022] Open
Abstract
As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses.
Collapse
Affiliation(s)
- Georg Hahn
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA.
| | - Sanghun Lee
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA
- Department of Medical Consilience, Graduate School, Dankook University, Yongin, South Korea
| | - Dmitry Prokopenko
- Genetics and Aging Research Unit, Department of Neurology, McCance Center for Brain Health, Massachusetts General Hospital, Boston, MA, 02114, USA
| | - Jonathan Abraham
- Department of Microbiology, Harvard Medical School, Blavatnik Institute, 77 Avenue Louis Pasteur, Boston, MA, 02115, USA
| | - Tanya Novak
- Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA, 02115, USA
| | - Julian Hecker
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Michael Cho
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | | | - Lindsey R Baden
- Division of Infectious Diseases, Harvard Medical School, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Adrienne G Randolph
- Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, MA, 02115, USA
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
| | - Scott T Weiss
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA
| | - Christoph Lange
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, 02115, USA
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, MA, 02115, USA
| |
Collapse
|
4
|
Hahn G, Wu CM, Lee S, Lutz SM, Khurana S, Baden LR, Haneuse S, Qiao D, Hecker J, DeMeo DL, Tanzi RE, Choudhary MC, Etemad B, Mohammadi A, Esmaeilzadeh E, Cho MH, Li JZ, Randolph AG, Laird NM, Weiss ST, Silverman EK, Ribbeck K, Lange C. Genome-wide association analysis of COVID-19 mortality risk in SARS-CoV-2 genomes identifies mutation in the SARS-CoV-2 spike protein that colocalizes with P.1 of the Brazilian strain. Genet Epidemiol 2021; 45:685-693. [PMID: 34159627 PMCID: PMC8426743 DOI: 10.1002/gepi.22421] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 05/10/2021] [Accepted: 05/17/2021] [Indexed: 12/05/2022]
Abstract
SARS‐CoV‐2 mortality has been extensively studied in relation to host susceptibility. How sequence variations in the SARS‐CoV‐2 genome affect pathogenicity is poorly understood. Starting in October 2020, using the methodology of genome‐wide association studies (GWAS), we looked at the association between whole‐genome sequencing (WGS) data of the virus and COVID‐19 mortality as a potential method of early identification of highly pathogenic strains to target for containment. Although continuously updating our analysis, in December 2020, we analyzed 7548 single‐stranded SARS‐CoV‐2 genomes of COVID‐19 patients in the GISAID database and associated variants with mortality using a logistic regression. In total, evaluating 29,891 sequenced loci of the viral genome for association with patient/host mortality, two loci, at 12,053 and 25,088 bp, achieved genome‐wide significance (p values of 4.09e−09 and 4.41e−23, respectively), though only 25,088 bp remained significant in follow‐up analyses. Our association findings were exclusively driven by the samples that were submitted from Brazil (p value of 4.90e−13 for 25,088 bp). The mutation frequency of 25,088 bp in the Brazilian samples on GISAID has rapidly increased from about 0.4 in October/December 2020 to 0.77 in March 2021. Although GWAS methodology is suitable for samples in which mutation frequencies varies between geographical regions, it cannot account for mutation frequencies that change rapidly overtime, rendering a GWAS follow‐up analysis of the GISAID samples that have been submitted after December 2020 as invalid. The locus at 25,088 bp is located in the P.1 strain, which later (April 2021) became one of the distinguishing loci (precisely, substitution V1176F) of the Brazilian strain as defined by the Centers for Disease Control. Specifically, the mutations at 25,088 bp occur in the S2 subunit of the SARS‐CoV‐2 spike protein, which plays a key role in viral entry of target host cells. Since the mutations alter amino acid coding sequences, they potentially imposing structural changes that could enhance viral infectivity and symptom severity. Our analysis suggests that GWAS methodology can provide suitable analysis tools for the real‐time detection of new more transmissible and pathogenic viral strains in databases such as GISAID, though new approaches are needed to accommodate rapidly changing mutation frequencies over time, in the presence of simultaneously changing case/control ratios. Improvements of the associated metadata/patient information in terms of quality and availability will also be important to fully utilize the potential of GWAS methodology in this field.
Collapse
Affiliation(s)
- Georg Hahn
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Chloe M Wu
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Sanghun Lee
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.,Department of Medical Consilience, Graduate School, Dankook University, Yongin, South Korea
| | - Sharon M Lutz
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.,PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, USA
| | | | - Lindsey R Baden
- Division of Infectious Diseases, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Sebastien Haneuse
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Dandi Qiao
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA.,Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Julian Hecker
- PRecisiOn Medicine Translational Research (PROMoTeR) Center, Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, USA.,Harvard Medical School, Harvard University, Boston, Massachusetts, USA
| | - Dawn L DeMeo
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA.,Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Rudolph E Tanzi
- Genetics and Aging Research Unit, McCance Center for Brain Health, Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| | | | - Behzad Etemad
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA
| | - Abbas Mohammadi
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA
| | | | - Michael H Cho
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA.,Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Jonathan Z Li
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA
| | - Adrienne G Randolph
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA.,Department of Anesthesiology, Critical Care and Pain Medicine, Boston Children's Hospital, Boston, Massachusetts, USA
| | - Nan M Laird
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Scott T Weiss
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA.,Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Edwin K Silverman
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA.,Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| | - Katharina Ribbeck
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Christoph Lange
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.,Harvard Medical School, Harvard University, Boston, Massachusetts, USA.,Department of Medicine, Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA
| |
Collapse
|
5
|
Hahn G, Lee S, Weiss ST, Lange C. Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus. Genet Epidemiol 2021; 45:316-323. [PMID: 33415739 PMCID: PMC8005425 DOI: 10.1002/gepi.22373] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Revised: 11/19/2020] [Accepted: 11/20/2020] [Indexed: 11/11/2022]
Abstract
Over 10,000 viral genome sequences of the SARS-CoV-2virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11,132 SARS-CoV-2 patients in the GISAID database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modeling the mutation rate, applying phylogenetic tree approaches, and so forth, we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index. Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.
Collapse
Affiliation(s)
- Georg Hahn
- Department of Biostatistics, T.H. Chan School of Public
Health, Harvard University, Boston, MA 02115, USA
| | - Sanghun Lee
- Department of Biostatistics, T.H. Chan School of Public
Health, Harvard University, Boston, MA 02115, USA
- Department of Medical Consilience, Graduate School, Dankook
University, South Korea
| | - Scott T. Weiss
- Channing Division of Network Medicine, Department of
Medicine, Brigham and Women’s Hospital, and Harvard Medical School, Boston,
MA 02115
| | - Christoph Lange
- Department of Biostatistics, T.H. Chan School of Public
Health, Harvard University, Boston, MA 02115, USA
| |
Collapse
|
6
|
Hahn G, Lutz SM, Hecker J, Prokopenko D, Cho MH, Silverman EK, Weiss ST, Lange C, The NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium. locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies. Genet Epidemiol 2021; 45:82-98. [PMID: 32929743 PMCID: PMC7856019 DOI: 10.1002/gepi.22356] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 08/05/2020] [Accepted: 08/24/2020] [Indexed: 01/08/2023]
Abstract
locStra is an R -package for the analysis of regional and global population stratification in whole-genome sequencing (WGS) studies, where regional stratification refers to the substructure defined by the loci in a particular region on the genome. Population substructure can be assessed based on the genetic covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix. Using a sliding window approach, the regional similarity matrices are compared with the global ones, based on user-defined window sizes and metrics, for example, the correlation between regional and global eigenvectors. An algorithm for the specification of the window size is provided. As the implementation fully exploits sparse matrix algebra and is written in C++, the analysis is highly efficient. Even on single cores, for realistic study sizes (several thousand subjects, several million rare variants per subject), the runtime for the genome-wide computation of all regional similarity matrices does typically not exceed one hour, enabling an unprecedented investigation of regional stratification across the entire genome. The package is applied to three WGS studies, illustrating the varying patterns of regional substructure across the genome and its beneficial effects on association testing.
Collapse
Affiliation(s)
- Georg Hahn
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Sharon M. Lutz
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | - Julian Hecker
- Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA
| | - Dmitry Prokopenko
- Massachusetts General Hospital, Harvard University, Boston, Massachusetts, USA
| | - Michael H. Cho
- Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA
| | - Edwin K. Silverman
- Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA
| | - Scott T. Weiss
- Department of Medicine, Brigham and Women's Hospital, Harvard University, Boston, Massachusetts, USA
| | - Christoph Lange
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA
| | | |
Collapse
|
7
|
Hahn G, Lee S, Weiss ST, Lange C. Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2020. [PMID: 32637949 DOI: 10.1101/2020.05.05.079061] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Over 10,000 viral genome sequences of the SARS-CoV-2 virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website ( http://virological.org/ ) early on January 11. We utilize the published data on the single stranded RNAs of 11, 132 SARS-CoV-2 patients in the GISAID (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7, 640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modelling the mutation rate, applying phylogenetic tree approaches, etc., we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index (Jaccard, 1901; Tan et al., 2005; Prokopenko et al., 2016; Schlauch et al., 2017). Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.
Collapse
|
8
|
Hahn G, Cho MH, Weiss ST, Silverman EK, Lange C. Unsupervised cluster analysis of SARS-CoV-2 genomes indicates that recent (June 2020) cases in Beijing are from a genetic subgroup that consists of mostly European and South(east) Asian samples, of which the latter are the most recent. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2020. [PMID: 32637951 DOI: 10.1101/2020.06.22.165936] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Research efforts of the ongoing SARS-CoV-2 pandemic have focused on viral genome sequence analysis to understand how the virus spread across the globe. Here, we assess three recently identified SARS-CoV-2 genomes in Beijing from June 2020 and attempt to determine the origin of these genomes, made available in the GISAID database. The database contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Including the three new samples and excluding samples with missing annotations, we analyzed 7, 643 SARS-CoV-2 genomes. Using principal component analysis computed on a similarity matrix that compares all pairs of the SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index, we find that the newly discovered virus genomes from Beijing are in a genetic cluster that consists mostly of cases from Europe and South(east) Asia. The sequences of the new cases are most related to virus genomes from a small number of cases from China (March 2020), cases from Europe (February to early May 2020), and cases from South(east) Asia (May to June 2020). These findings could suggest that the original cases of this genetic cluster originated from China in March 2020 and were re-introduced to China by transmissions from samples from South(east) Asia between April and June 2020.
Collapse
|
9
|
An J, Won S, Lutz SM, Hecker J, Lange C. Effect of population stratification on SNP-by-environment interaction. Genet Epidemiol 2019; 43:1046-1055. [PMID: 31429121 DOI: 10.1002/gepi.22250] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 06/04/2019] [Accepted: 07/11/2019] [Indexed: 11/10/2022]
Abstract
Proportions of false-positive rates in genome-wide association analysis are affected by population stratification, and if it is not correctly adjusted, the statistical analysis can produce the large false-negative finding. Therefore various approaches have been proposed to adjust such problems in genome-wide association studies. However, in spite of its importance, a few studies have been conducted in genome-wide single nucleotide polymorphism (SNP)-by-environment interaction studies. In this report, we illustrate in which scenarios can lead to the false-positive rates in association mapping and approach to maintaining the overall type-1 error rate.
Collapse
Affiliation(s)
- Jaehoon An
- Department of Public Health Sciences, Graduate School of Public Health, Seoul National University, Seoul, South Korea
| | - Sungho Won
- Department of Public Health Sciences, Graduate School of Public Health, Seoul National University, Seoul, South Korea.,Interdisciplinary Program for Bioinformatics, College of Natural Science, Seoul National University, Seoul, South Korea.,Institute of Health and Environment, Seoul National University, Seoul, South Korea.,Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts
| | - Sharon M Lutz
- Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, Massachusetts
| | - Julian Hecker
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts.,Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts
| | - Christoph Lange
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts.,Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts
| |
Collapse
|