Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

32
(from Reference Citation Analysis)

Article PDFs (12)

Cited by > 0 (24)

Searched Name

Petros Drineas

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Statistics

Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Category

Show more Refine

Number	Citation Analysis
1	Multiomic approach and Mendelian randomization analysis identify causal associations between blood biomarkers and subcortical brain structure volumes. Neuroimage 2023;284:120466. [PMID: 37995919 DOI: 10.1016/j.neuroimage.2023.120466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 10/17/2023] [Accepted: 11/20/2023] [Indexed: 11/25/2023] Open Abstract Alterations in subcortical brain structure volumes have been found to be associated with several neurodegenerative and psychiatric disorders. At the same time, genome-wide association studies (GWAS) have identified numerous common variants associated with brain structure. In this study, we integrate these findings, aiming to identify proteins, metabolites, or microbes that have a putative causal association with subcortical brain structure volumes via a two-sample Mendelian randomization approach. This method uses genetic variants as instrument variables to identify potentially causal associations between an exposure and an outcome. The exposure data that we analyzed comprised genetic associations for 2994 plasma proteins, 237 metabolites, and 103 microbial genera. The outcome data included GWAS data for seven subcortical brain structure volumes including accumbens, amygdala, caudate, hippocampus, pallidum, putamen, and thalamus. Eleven proteins and six metabolites were found to have a significant association with subcortical structure volumes, with nine proteins and five metabolites replicated using independent exposure data. We found causal associations between accumbens volume and plasma protease c1 inhibitor as well as strong association between putamen volume and Agouti signaling protein. Among metabolites, urate had the strongest association with thalamic volume. No significant associations were detected between the microbial genera and subcortical brain structure volumes. We also observed significant enrichment for biological processes such as proteolysis, regulation of the endoplasmic reticulum apoptotic signaling pathway, and negative regulation of DNA binding. Our findings provide insights to the mechanisms through which brain volumes may be affected in the pathogenesis of neurodevelopmental and psychiatric disorders and point to potential treatment targets for disorders that are associated with subcortical brain structure volumes. Collapse Key Words Mendelian randomization Metabolites Neurological disorders Proteome Subcortical brain volume Collapse MESH Headings Humans Mendelian Randomization Analysis Genome-Wide Association Study/methods Multiomics Brain/diagnostic imaging Brain/pathology Biomarkers Magnetic Resonance Imaging/methods Collapse Grants Collapse
2	Can polygenic risk scores help explain disease prevalence differences around the world? A worldwide investigation. BMC Genom Data 2023;24:70. [PMID: 37986041 PMCID: PMC10662565 DOI: 10.1186/s12863-023-01168-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 10/20/2023] [Indexed: 11/22/2023] Open Abstract Complex disorders are caused by a combination of genetic, environmental and lifestyle factors, and their prevalence can vary greatly across different populations. The extent to which genetic risk, as identified by Genome Wide Association Study (GWAS), correlates to disease prevalence in different populations has not been investigated systematically. Here, we studied 14 different complex disorders and explored whether polygenic risk scores (PRS) based on current GWAS correlate to disease prevalence within Europe and around the world. A clear variation in GWAS-based genetic risk was observed based on ancestry and we identified populations that have a higher genetic liability for developing certain disorders. We found that for four out of the 14 studied disorders, PRS significantly correlates to disease prevalence within Europe. We also found significant correlations between worldwide disease prevalence and PRS for eight of the studied disorders with Multiple Sclerosis genetic risk having the highest correlation to disease prevalence. Based on current GWAS results, the across population differences in genetic risk for certain disorders can potentially be used to understand differences in disease prevalence and identify populations with the highest genetic liability. The study highlights both the limitations of PRS based on current GWAS but also the fact that in some cases, PRS may already have high predictive power. This could be due to the genetic architecture of specific disorders or increased GWAS power in some cases. Collapse Key Words Ancestry Disease prevalence GWAS PRS Polygenic risk score Collapse MESH Headings Humans Genetic Predisposition to Disease/genetics Genome-Wide Association Study/methods Prevalence Risk Factors Multifactorial Inheritance/genetics Collapse Grants National Science Foundation Collapse
3	Structure-informed clustering for population stratification in association studies. BMC Bioinformatics 2023;24:411. [PMID: 37907836 PMCID: PMC10619291 DOI: 10.1186/s12859-023-05511-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Accepted: 10/02/2023] [Indexed: 11/02/2023] Open Abstract BACKGROUND Identifying variants associated with complex traits is a challenging task in genetic association studies due to linkage disequilibrium (LD) between genetic variants and population stratification, unrelated to the disease risk. Existing methods of population structure correction use principal component analysis or linear mixed models with a random effect when modeling associations between a trait of interest and genetic markers. However, due to stringent significance thresholds and latent interactions between the markers, these methods often fail to detect genuinely associated variants. RESULTS To overcome this, we propose CluStrat, which corrects for complex arbitrarily structured populations while leveraging the linkage disequilibrium induced distances between genetic markers. It performs an agglomerative hierarchical clustering using the Mahalanobis distance covariance matrix of the markers. In simulation studies, we show that our method outperforms existing methods in detecting true causal variants. Applying CluStrat on WTCCC2 and UK Biobank cohorts, we found biologically relevant associations in Schizophrenia and Myocardial Infarction. CluStrat was also able to correct for population structure in polygenic adaptation of height in Europeans. CONCLUSIONS CluStrat highlights the advantages of biologically relevant distance metrics, such as the Mahalanobis distance, which captures the cryptic interactions within populations in the presence of LD better than the Euclidean distance. Collapse Key Words Association studies Clustering Populations structure Collapse MESH Headings Humans Genetic Markers Polymorphism, Single Nucleotide Linkage Disequilibrium Phenotype Cluster Analysis Collapse Grants 1319280 Division of Information and Intelligent Systems 1319280 Division of Information and Intelligent Systems 1319280 Division of Information and Intelligent Systems International Business Machines Corporation Collapse
4	PheWAS and cross-disorder analysis reveal genetic architecture, pleiotropic loci and phenotypic correlations across 11 autoimmune disorders. Front Immunol 2023;14:1147573. [PMID: 37809097 PMCID: PMC10552152 DOI: 10.3389/fimmu.2023.1147573] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Accepted: 09/04/2023] [Indexed: 10/10/2023] Open Abstract Introduction Autoimmune disorders (ADs) are a group of about 80 disorders that occur when self-attacking autoantibodies are produced due to failure in the self-tolerance mechanisms. ADs are polygenic disorders and associations with genes both in the human leukocyte antigen (HLA) region and outside of it have been described. Previous studies have shown that they are highly comorbid with shared genetic risk factors, while epidemiological studies revealed associations between various lifestyle and health-related phenotypes and ADs. Methods Here, for the first time, we performed a comparative polygenic risk score (PRS) - Phenome Wide Association Study (PheWAS) for 11 different ADs (Juvenile Idiopathic Arthritis, Primary Sclerosing Cholangitis, Celiac Disease, Multiple Sclerosis, Rheumatoid Arthritis, Psoriasis, Myasthenia Gravis, Type 1 Diabetes, Systemic Lupus Erythematosus, Vitiligo Late Onset, Vitiligo Early Onset) and 3,254 phenotypes available in the UK Biobank that include a wide range of socio-demographic, lifestyle and health-related outcomes. Additionally, we investigated the genetic relationships of the studied ADs, calculating their genetic correlation and conducting cross-disorder GWAS meta-analyses for the observed AD clusters. Results In total, we identified 508 phenotypes significantly associated with at least one AD PRS. 272 phenotypes were significantly associated after excluding variants in the HLA region from the PRS estimation. Through genetic correlation and genetic factor analyses, we identified four genetic factors that run across studied ADs. Cross-trait meta-analyses within each factor revealed pleiotropic genome-wide significant loci. Discussion Overall, our study confirms the association of different factors with genetic susceptibility for ADs and reveals novel observations that need to be further explored. Collapse Key Words GWAS PRS PheWAS autoimmune disorders cross-disorder meta-analysis Collapse MESH Headings Humans Autoimmune Diseases/genetics Diabetes Mellitus, Type 1/genetics HLA Antigens Phenotype Polymorphism, Single Nucleotide Vitiligo Collapse Grants Collapse
5	Multiomic approach and Mendelian randomization analysis identify causal associations between blood biomarkers and subcortical brain structure volumes. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.03.30.23287968. [PMID: 37066330 PMCID: PMC10104218 DOI: 10.1101/2023.03.30.23287968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023] Abstract Alterations in subcortical brain structure volumes have been found to be associated with several neurodegenerative and psychiatric disorders. At the same time, genome-wide association studies (GWAS) have identified numerous common variants associated with brain structure. In this study, we integrate these findings, aiming to identify proteins, metabolites, or microbes that have a putative causal association with subcortical brain structure volumes via a two-sample Mendelian randomization approach. This method uses genetic variants as instrument variables to identify potentially causal associations between an exposure and an outcome. The exposure data that we analyzed comprised genetic associations for 2,994 plasma proteins, 237 metabolites, and 103 microbial genera. The outcome data included GWAS data for seven subcortical brain structure volumes including accumbens, amygdala, caudate, hippocampus, pallidum, putamen, and thalamus. Eleven proteins and six metabolites were found to have a significant association with subcortical structure volumes. We found causal associations between amygdala volume and granzyme A as well as association between accumbens volume and plasma protease c1 inhibitor. Among metabolites, urate had the strongest association with thalamic volume. No significant associations were detected between the microbial genera and subcortical brain structure volumes. We also observed significant enrichment for biological processes such as proteolysis, regulation of the endoplasmic reticulum apoptotic signaling pathway, and negative regulation of DNA binding. Our findings provide insights to the mechanisms through which brain volumes may be affected in the pathogenesis of neurodevelopmental and psychiatric disorders and point to potential treatment targets for disorders that are associated with subcortical brain structure volumes. Collapse Key Words Collapse MESH Headings Collapse Grants R01 MH126213 NIMH NIH HHS K23 MH085057 NIMH NIH HHS R01 NS105746 NINDS NIH HHS U01 NS040024 NINDS NIH HHS K02 NS085048 NINDS NIH HHS Collapse
6	Polygenic risk score-based phenome-wide association study identifies novel associations for Tourette syndrome. Transl Psychiatry 2023;13:69. [PMID: 36823209 PMCID: PMC9950421 DOI: 10.1038/s41398-023-02341-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 01/23/2023] [Accepted: 01/27/2023] [Indexed: 02/25/2023] Open Abstract Tourette Syndrome (TS) is a complex neurodevelopmental disorder characterized by vocal and motor tics lasting more than a year. It is highly polygenic in nature with both rare and common previously associated variants. Epidemiological studies have shown TS to be correlated with other phenotypes, but large-scale phenome wide analyses in biobank level data have not been performed to date. In this study, we used the summary statistics from the latest meta-analysis of TS to calculate the polygenic risk score (PRS) of individuals in the UK Biobank data and applied a Phenome Wide Association Study (PheWAS) approach to determine the association of disease risk with a wide range of phenotypes. A total of 57 traits were found to be significantly associated with TS polygenic risk, including multiple psychosocial factors and mental health conditions such as anxiety disorder and depression. Additional associations were observed with complex non-psychiatric disorders such as Type 2 diabetes, heart palpitations, and respiratory conditions. Cross-disorder comparisons of phenotypic associations with genetic risk for other childhood-onset disorders (e.g.: attention deficit hyperactivity disorder [ADHD], autism spectrum disorder [ASD], and obsessive-compulsive disorder [OCD]) indicated an overlap in associations between TS and these disorders. ADHD and ASD had a similar direction of effect with TS while OCD had an opposite direction of effect for all traits except mental health factors. Sex-specific PheWAS analysis identified differences in the associations with TS genetic risk between males and females. Type 2 diabetes and heart palpitations were significantly associated with TS risk in males but not in females, whereas diseases of the respiratory system were associated with TS risk in females but not in males. This analysis provides further evidence of shared genetic and phenotypic architecture of different complex disorders. Collapse Key Words genomics psychiatric disorders Collapse MESH Headings Male Female Humans Tourette Syndrome/genetics Diabetes Mellitus, Type 2 Autism Spectrum Disorder/genetics Attention Deficit Disorder with Hyperactivity/genetics Risk Factors Collapse Grants R01 NS102371 NINDS NIH HHS Department of Health R01 MH124679 NIMH NIH HHS P30 AG038072 NIA NIH HHS R01 NS105746 NINDS NIH HHS R01 MH126213 NIMH NIH HHS P50 HD103537 NICHD NIH HHS National Science Foundation (NSF) U.S. Department of Health & Human Services \| NIH \| National Institute of Neurological Disorders and Stroke (NINDS) U.S. Department of Health & Human Services \| NIH \| National Institute of Mental Health (NIMH) Deutsche Forschungsgemeinschaft (German Research Foundation) KNAW Academy Professor Award (PAH/6635) Narodowe Centrum Nauki (National Science Centre) Employee of Boehringer Ingelheim Pharma Collapse
7	Reconstructing SNP allele and genotype frequencies from GWAS summary statistics. Sci Rep 2022;12:8242. [PMID: 35581276 PMCID: PMC9114146 DOI: 10.1038/s41598-022-12185-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Accepted: 04/27/2022] [Indexed: 11/24/2022] Open Abstract The emergence of genome-wide association studies (GWAS) has led to the creation of large repositories of human genetic variation, creating enormous opportunities for genetic research and worldwide collaboration. Methods that are based on GWAS summary statistics seek to leverage such records, overcoming barriers that often exist in individual-level data access while also offering significant computational savings. Such summary-statistics-based applications include GWAS meta-analysis, with and without sample overlap, and case-case GWAS. We compare performance of leading methods for summary-statistics-based genomic analysis and also introduce a novel framework that can unify usual summary-statistics-based implementations via the reconstruction of allelic and genotypic frequencies and counts (ReACt). First, we evaluate ASSET, METAL, and ReACt using both synthetic and real data for GWAS meta-analysis (with and without sample overlap) and find that, while all three methods are comparable in terms of power and error control, ReACt and METAL are faster than ASSET by a factor of at least hundred. We then proceed to evaluate performance of ReACt vs an existing method for case-case GWAS and show comparable performance, with ReACt requiring minimal underlying assumptions and being more user-friendly. Finally, ReACt allows us to evaluate, for the first time, an implementation for calculating polygenic risk score (PRS) for groups of cases and controls based on summary statistics. Our work demonstrates the power of GWAS summary-statistics-based methodologies and the proposed novel method provides a unifying framework and allows further extension of possibilities for researchers seeking to understand the genetics of complex disease. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
8	Enhancing neuroimaging genetics through meta-analysis for Tourette syndrome (ENIGMA-TS): A worldwide platform for collaboration. Front Psychiatry 2022;13:958688. [PMID: 36072455 PMCID: PMC9443935 DOI: 10.3389/fpsyt.2022.958688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 07/18/2022] [Indexed: 11/13/2022] Open Abstract Tourette syndrome (TS) is characterized by multiple motor and vocal tics, and high-comorbidity rates with other neuropsychiatric disorders. Obsessive compulsive disorder (OCD), attention deficit hyperactivity disorder (ADHD), autism spectrum disorders (ASDs), major depressive disorder (MDD), and anxiety disorders (AXDs) are among the most prevalent TS comorbidities. To date, studies on TS brain structure and function have been limited in size with efforts mostly fragmented. This leads to low-statistical power, discordant results due to differences in approaches, and hinders the ability to stratify patients according to clinical parameters and investigate comorbidity patterns. Here, we present the scientific premise, perspectives, and key goals that have motivated the establishment of the Enhancing Neuroimaging Genetics through Meta-Analysis for TS (ENIGMA-TS) working group. The ENIGMA-TS working group is an international collaborative effort bringing together a large network of investigators who aim to understand brain structure and function in TS and dissect the underlying neurobiology that leads to observed comorbidity patterns and clinical heterogeneity. Previously collected TS neuroimaging data will be analyzed jointly and integrated with TS genomic data, as well as equivalently large and already existing studies of highly comorbid OCD, ADHD, ASD, MDD, and AXD. Our work highlights the power of collaborative efforts and transdiagnostic approaches, and points to the existence of different TS subtypes. ENIGMA-TS will offer large-scale, high-powered studies that will lead to important insights toward understanding brain structure and function and genetic effects in TS and related disorders, and the identification of biomarkers that could help inform improved clinical practice. Collapse Key Words ENIGMA Tourette syndrome brain MRI genetics neuroimaging Collapse MESH Headings Collapse Grants Collapse
9	Integrating Linguistics, Social Structure, and Geography to Model Genetic Diversity within India. Mol Biol Evol 2021;38:1809-1819. [PMID: 33481022 PMCID: PMC8097304 DOI: 10.1093/molbev/msaa321] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open Abstract India represents an intricate tapestry of population substructure shaped by geography, language, culture, and social stratification. Although geography closely correlates with genetic structure in other parts of the world, the strict endogamy imposed by the Indian caste system and the large number of spoken languages add further levels of complexity to understand Indian population structure. To date, no study has attempted to model and evaluate how these factors have interacted to shape the patterns of genetic diversity within India. We merged all publicly available data from the Indian subcontinent into a data set of 891 individuals from 90 well-defined groups. Bringing together geography, genetics, and demographic factors, we developed Correlation Optimization of Genetics and Geodemographics to build a model that explains the observed population genetic substructure. We show that shared language along with social structure have been the most powerful forces in creating paths of gene flow in the subcontinent. Furthermore, we discover the ethnic groups that best capture the diverse genetic substructure using a ridge leverage score statistic. Integrating data from India with a data set of additional 1,323 individuals from 50 Eurasian populations, we find that Indo-European and Dravidian speakers of India show shared genetic drift with Europeans, whereas the Tibeto-Burman speaking tribal groups have maximum shared genetic drift with East Asians. Collapse Key Words India South Asia algorithms data mining genomics population structure Collapse MESH Headings Ethnicity/genetics Genetic Variation Geography Humans India Language Models, Genetic Sociological Factors Collapse Grants National Science Foundation Collapse
10	Constructing Compact Signatures for Individual Fingerprinting of Brain Connectomes. Front Neurosci 2021;15:549322. [PMID: 33889066 PMCID: PMC8055927 DOI: 10.3389/fnins.2021.549322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Accepted: 03/08/2021] [Indexed: 11/13/2022] Open Abstract Recent neuroimaging studies have shown that functional connectomes are unique to individuals, i.e., two distinct fMRIs taken over different sessions of the same subject are more similar in terms of their connectomes than those from two different subjects. In this study, we present new results that identify specific parts of resting state and task-specific connectomes that are responsible for the unique signatures. We show that a very small part of the connectome can be used to derive features for discriminating between individuals. A network of these features is shown to achieve excellent training and test accuracy in matching imaging datasets. We show that these features are statistically significant, robust to perturbations, invariant across populations, and are localized to a small number of structural regions of the brain. Furthermore, we show that for task-specific connectomes, the regions identified by our method are consistent with their known functional characterization. We present a new matrix sampling technique to derive computationally efficient and accurate methods for identifying the discriminating sub-connectome and support all of our claims using state-of-the-art statistical tests and computational techniques. Collapse Key Words dimensionality reduction fingerprinting functional connectomics matrix sampling randomized numerical linear algebra Collapse MESH Headings Collapse Grants Collapse
11	TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics 2020;35:3679-3683. [PMID: 30957838 DOI: 10.1093/bioinformatics/btz157] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Revised: 02/26/2019] [Accepted: 04/04/2019] [Indexed: 11/12/2022] Open Abstract MOTIVATION Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. RESULTS We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. AVAILABILITY AND IMPLEMENTATION Source code and documentation are both available at https://github.com/aritra90/TeraPCA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
12	Near Optimal Linear Algebra in the Online and Sliding Window Models. PROCEEDINGS ... ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE. SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE 2020;1:517-528. [PMID: 34421392 PMCID: PMC8375632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023] Abstract We initiate the study of numerical linear algebra in the sliding window model, where only the most recent W updates in a stream form the underlying data set. Although many existing algorithms in the sliding window model use or borrow elements from the smooth histogram framework (Braverman and Ostrovsky, FOCS 2007), we show that many interesting linear-algebraic problems, including spectral and vector induced matrix norms, generalized regression, and lowrank approximation, are not amenable to this approach in the row-arrival model. To overcome this challenge, we first introduce a unified row-sampling based framework that gives randomized algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and ℓ ₁-subspace embeddings in the sliding window model, which often use nearly optimal space and achieve nearly input sparsity runtime. Our algorithms are based on "reverse online" versions of offline sampling distributions such as (ridge) leverage scores, ℓ ₁ sensitivities, and Lewis weights to quantify both the importance and the recency of a row; our structural results on these distributions may be of independent interest for future algorithmic design. Although our techniques initially address numerical linear algebra in the sliding window model, our row-sampling framework rather surprisingly implies connections to the well-studied online model; our structural results also give the first sample optimal (up to lower order terms) online algorithm for low-rank approximation/projection-cost preservation. Using this powerful primitive, we give online algorithms for column/row subset selection and principal component analysis that resolves the main open question of Bhaskara et al. (FOCS 2019). We also give the first online algorithm for ℓ ₁-subspace embeddings. We further formalize the connection between the online model and the sliding window model by introducing an additional unified framework for deterministic algorithms using a merge and reduce paradigm and the concept of online coresets, which we define as a weighted subset of rows of the input matrix that can be used to compute a good approximation to some given function on all of its prefixes. Our sampling based algorithms in the row-arrival online model yield online coresets, giving deterministic algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and ℓ _1-subspace embeddings in the sliding window model that use nearly optimal space. Collapse Key Words numerical linear algebra online algorithms sliding window model streaming algorithms Collapse MESH Headings Collapse Grants R01 HG010798 NHGRI NIH HHS Collapse
13	Genetic history of the population of Crete. Ann Hum Genet 2019;83:373-388. [PMID: 31192450 PMCID: PMC6851683 DOI: 10.1111/ahg.12328] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 02/10/2019] [Accepted: 05/01/2019] [Indexed: 01/10/2023] Abstract The medieval history of several populations often suffers from scarcity of contemporary records resulting in contradictory and sometimes biased interpretations by historians. This is the situation with the population of the island of Crete, which remained relatively undisturbed until the Middle Ages when multiple wars, invasions, and occupations by foreigners took place. Historians have considered the effects of the occupation of Crete by the Arabs (in the 9th and 10th centuries C.E.) and the Venetians (in the 13th to the 17th centuries C.E.) to the local population. To obtain insights on such effects from a genetic perspective, we studied representative samples from 17 Cretan districts using the Illumina 1 million or 2.5 million arrays and compared the Cretans to the populations of origin of the medieval conquerors and settlers. Highlights of our findings include (1) small genetic contributions from the Arab occupation to the extant Cretan population, (2) low genetic contribution of the Venetians to the extant Cretan population, and (3) evidence of a genetic relationship among the Cretans and Central, Northern, and Eastern Europeans, which could be explained by the settlement in the island of northern origin tribes during the medieval period. Our results show how the interaction between genetics and the historical record can help shed light on the historical record. Collapse Key Words crete greece historical genetics medieval history population genetics whole-genome Collapse MESH Headings Collapse Grants Collapse
14	Randomized Linear Algebra Approaches to Estimate the Von Neumann Entropy of Density Matrices. IEEE TRANSACTIONS ON INFORMATION THEORY 2018;66:5003-5021. [PMID: 33746243 PMCID: PMC7971349 DOI: 10.1109/tit.2020.2971991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023] Abstract The von Neumann entropy, named after John von Neumann, is an extension of the classical concept of entropy to the field of quantum mechanics. From a numerical perspective, von Neumann entropy can be computed simply by computing all eigenvalues of a density matrix, an operation that could be prohibitively expensive for large-scale density matrices. We present and analyze three randomized algorithms to approximate von Neumann entropy of real density matrices: our algorithms leverage recent developments in the Randomized Numerical Linear Algebra (RandNLA) literature, such as randomized trace estimators, provable bounds for the power method, and the use of random projections to approximate the eigenvalues of a matrix. All three algorithms come with provable accuracy guarantees and our experimental evaluations support our theoretical findings showing considerable speedup with small loss in accuracy. Collapse Key Words Chebyshev polynomials Taylor polynomials randNLA random projections randomized algorithms von Neumann entropy Collapse MESH Headings Collapse Grants U01 CA198941 NCI NIH HHS Collapse
15	Variant Ranker: a web-tool to rank genomic data according to functional significance. BMC Bioinformatics 2017;18:341. [PMID: 28716001 PMCID: PMC5514526 DOI: 10.1186/s12859-017-1752-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2017] [Accepted: 07/05/2017] [Indexed: 04/09/2023] Open Abstract BACKGROUND The increasing volume and complexity of high-throughput genomic data make analysis and prioritization of variants difficult for researchers with limited bioinformatics skills. Variant Ranker allows researchers to rank identified variants and determine the most confident variants for experimental validation. RESULTS We describe Variant Ranker, a user-friendly simple web-based tool for ranking, filtering and annotation of coding and non-coding variants. Variant Ranker facilitates the identification of causal variants based on novelty, effect and annotation information. The algorithm implements and aggregates multiple prediction algorithm scores, conservation scores, allelic frequencies, clinical information and additional open-source annotations using accessible databases via ANNOVAR. The available information for a variant is transformed into user-specified weights, which are in turn encoded into the ranking algorithm. Through its different modules, users can (i) rank a list of variants (ii) perform genotype filtering for case-control samples (iii) filter large amounts of high-throughput data based on user custom filter requirements and apply different models of inheritance (iv) perform downstream functional enrichment analysis through network visualization. Using networks, users can identify clusters of genes that belong to multiple ontology categories (like pathways, gene ontology, disease categories) and therefore expedite scientific discoveries. We demonstrate the utility of Variant Ranker to identify causal genes using real and synthetic datasets. Our results indicate that Variant Ranker exhibits excellent performance by correctly identifying and ranking the candidate genes CONCLUSIONS: Variant Ranker is a freely available web server on http://paschou-lab.mbg.duth.gr/Software.html . This tool will enable users to prioritise potentially causal variants and is applicable to a wide range of sequencing data. Collapse Key Words Next-generation sequencing Prioritisation Ranking Collapse MESH Headings Algorithms Gene Frequency Gene Ontology Genetic Variation Genomics/methods Genotype Humans Internet Sequence Analysis, DNA Software Collapse Grants FP7- People-2012-ITN FP7 project EMTICS Collapse
16	Targeted Re-Sequencing Approach of Candidate Genes Implicates Rare Potentially Functional Variants in Tourette Syndrome Etiology. Front Neurosci 2016;10:428. [PMID: 27708560 PMCID: PMC5030307 DOI: 10.3389/fnins.2016.00428] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2016] [Accepted: 09/02/2016] [Indexed: 12/13/2022] Open Abstract Although the genetic basis of Tourette Syndrome (TS) remains unclear, several candidate genes have been implicated. Using a set of 382 TS individuals of European ancestry we investigated four candidate genes for TS (HDC, SLITRK1, BTBD9, and SLC6A4) in an effort to identify possibly causal variants using a targeted re-sequencing approach by next generation sequencing technology. Identification of possible disease causing variants under different modes of inheritance was performed using the algorithms implemented in VAAST. We prioritized variants using Variant ranker and validated five rare variants via Sanger sequencing in HDC and SLITRK1, all of which are predicted to be deleterious. Intriguingly, one of the identified variants is in linkage disequilibrium with a variant that is included among the top hits of a genome-wide association study for response to citalopram treatment, an antidepressant drug with off-label use also in obsessive compulsive disorder. Our findings provide additional evidence for the implication of these two genes in TS susceptibility and the possible role of these proteins in the pathobiology of TS should be revisited. Collapse Key Words HDC SLITRK1 TS candidate genes genetic susceptibility next generation sequencing rare variants targeted re-sequencing Collapse MESH Headings Collapse Grants Collapse
17	TS-EUROTRAIN: A European-Wide Investigation and Training Network on the Etiology and Pathophysiology of Gilles de la Tourette Syndrome. Front Neurosci 2016;10:384. [PMID: 27601976 PMCID: PMC4994475 DOI: 10.3389/fnins.2016.00384] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 08/08/2016] [Indexed: 11/26/2022] Open Abstract Gilles de la Tourette Syndrome (GTS) is characterized by the presence of multiple motor and phonic tics with a fluctuating course of intensity, frequency, and severity. Up to 90% of patients with GTS present with comorbid conditions, most commonly attention-deficit/hyperactivity disorder (ADHD), and obsessive-compulsive disorder (OCD), thus providing an excellent model for the exploration of shared etiology across disorders. TS-EUROTRAIN (FP7-PEOPLE-2012-ITN, Grant Agr.No. 316978) is a Marie Curie Initial Training Network (http://ts-eurotrain.eu) that aims to elucidate the complex etiology of the onset and clinical course of GTS, investigate the neurobiological underpinnings of GTS and related disorders, translate research findings into clinical applications, and establish a pan-European infrastructure for the study of GTS. This includes the challenges of (i) assembling a large genetic database for the evaluation of the genetic architecture with high statistical power; (ii) exploring the role of gene-environment interactions including the effects of epigenetic phenomena; (iii) employing endophenotype-based approaches to understand the shared etiology between GTS, OCD, and ADHD; (iv) establishing a developmental animal model for GTS; (v) gaining new insights into the neurobiological mechanisms of GTS via cross-sectional and longitudinal neuroimaging studies; and (vi) partaking in outreach activities including the dissemination of scientific knowledge about GTS to the public. Fifteen partners from academia and industry and 12 PhD candidates pursue the project. Here, we aim to share the design of an interdisciplinary project, showcasing the potential of large-scale collaborative efforts in the field of GTS. Our ultimate aims are to elucidate the complex etiology and neurobiological underpinnings of GTS, translate research findings into clinical applications, and establish Pan-European infrastructure for the study of GTS and associated disorders. Collapse Key Words Gilles de la Tourette Syndrome Initial Training Network animal models etiology genetics neuroimaging tourette disorder Collapse MESH Headings Collapse Grants Collapse
18	Meta-Analysis of Tourette Syndrome and Attention Deficit Hyperactivity Disorder Provides Support for a Shared Genetic Basis. Front Neurosci 2016;10:340. [PMID: 27499730 PMCID: PMC4956656 DOI: 10.3389/fnins.2016.00340] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2016] [Accepted: 07/06/2016] [Indexed: 12/21/2022] Open Abstract Gilles de la Tourette Sydrome (TS) is a childhood onset neurodevelopmental disorder, characterized phenotypically by the presence of multiple motor and vocal tics. It is often accompanied by multiple psychiatric comorbidities, with Attention Deficit/Hyperactivity Disorder (ADHD) among the most common. The extensive co-occurrence of the two disorders suggests a shared genetic background. A major step toward the elucidation of the genetic architecture of TS was undertaken by the first TS Genome-wide Association Study (GWAS) reporting 552 SNPs that were moderately associated with TS (p < 1E-3). Similarly, initial ADHD GWAS attempts and meta-analysis were not able to produce genome-wide significant findings, but have provided insight to the genetic basis of the disorder. Here, we examine the common genetic background of the two neuropsychiatric phenotypes, by meta-analyzing the 552 top hits in the TS GWAS with the results of ADHD first GWASs. We identify 19 significant SNPs, with the top four implicated genes being TBC1D7, GUCY1A3, RAP1GDS1, and CHST11. TBCD17 harbors the top scoring SNP, rs1866863 (p:3.23E-07), located in a regulatory region downstream of the gene, and the third best-scoring SNP, rs2458304 (p:2.54E-06), located within an intron of the gene. Both variants were in linkage disequilibrium with eQTL rs499818, indicating a role in the expression levels of the gene. TBC1D7 is the third subunit of the TSC1/TSC2 complex, an inhibitor of the mTOR signaling pathway, with a central role in cell growth and autophagy. The top genes implicated by our study indicate a complex and intricate interplay between them, warranting further investigation into a possibly shared etiological mechanism for TS and ADHD. Collapse Key Words ADHD CHST11 GUCY1A3 RAP1GDS1 TBC1D7 Tourette Syndrome cross-disorder meta-analysis Collapse MESH Headings Collapse Grants Collapse
19	Feature Selection for Ridge Regression with Provable Guarantees. Neural Comput 2016;28:716-42. [DOI: 10.1162/neco_a_00816] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Abstract We introduce single-set spectral sparsification as a deterministic sampling–based feature selection technique for regularized least-squares classification, which is the classification analog to ridge regression. The method is unsupervised and gives worst-case guarantees of the generalization power of the classification function after feature selection with respect to the classification function obtained using all features. We also introduce leverage-score sampling as an unsupervised randomized feature selection method for ridge regression. We provide risk bounds for both single-set spectral sparsification and leverage-score sampling on ridge regression in the fixed design setting and show that the risk in the sampled space is comparable to the risk in the full-feature space. We perform experiments on synthetic and real-world data sets; a subset of TechTC-300 data sets, to support our theory. Experimental results indicate that the proposed methods perform better than the existing feature selection methods. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
20	Familial early-onset dementia with complex neuropathologic phenotype and genomic background. Neurobiol Aging 2016;42:199-204. [PMID: 27143436 DOI: 10.1016/j.neurobiolaging.2016.03.012] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Revised: 02/20/2016] [Accepted: 03/13/2016] [Indexed: 12/18/2022] Abstract Despite significant progress in our understanding of hereditary neurodegenerative diseases, the list of genes associated with early-onset dementia is not yet complete. In the present study, we describe a familial neurodegenerative disorder characterized clinically as the behavioral and/or dysexecutive variant of Alzheimer's disease with neuroradiologic features of Alzheimer's disease, however, lacking amyloid-β deposits in the brain. Instead, we observed a complex, 4 repeat predominant, tauopathy, together with a TAR DNA-binding protein of 43 kDa proteinopathy. Whole-exome sequencing on 2 affected siblings and 1 unaffected aunt uncovered a large number of candidate genes, including LRRK2 and SYNE2. In addition, DDI1, KRBA1, and TOR1A genes possessed novel stop-gain mutations only in the patients. Pathway, gene ontology, and network interaction analysis indicated the involvement of pathways related to neurodegeneration but revealed novel aspects also. This condition does not fit into any well-characterized category of neurodegenerative disorders. Exome sequencing did not disclose a single disease-specific gene mutation suggesting that a set of genes working together in different pathways may contribute to the etiology of the complex phenotype. Collapse Key Words Alzheimer disease Early-onset dementia Exome sequencing LRRK2 TDP-43 Tau Collapse MESH Headings Collapse Grants Collapse
21	Exploring genomic structure differences and similarities between the Greek and European HapMap populations: implications for association studies. Ann Hum Genet 2013;76:472-83. [PMID: 23061745 DOI: 10.1111/j.1469-1809.2012.00730.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Abstract Studies of the genomic structure of the Greek population and Southeastern Europe are limited, despite the central position of the area as a gateway for human migrations into Europe. HapMap has provided a unique tool for the analysis of human genetic variation. Europe is represented by the CEU (Northwestern Europe) and the TSI populations (Tuscan Italians from Southern Europe), which serve as reference for the design of genetic association studies. Furthermore, genetic association findings are often transferred to unstudied populations. Although initial studies support the fact that the CEU can, in general, be used as reference for the selection of tagging SNPs in European populations, this has not been extensively studied across Europe. We set out to explore the genomic structure of the Greek population (56 individuals) and compare it to the HapMap TSI and CEU populations. We studied 1112 SNPs (27 regions, 13 chromosomes). Although the HapMap European populations are, in general, a good reference for the Greek population, regions of population differentiation do exist and results should not be light-heartedly generalized. We conclude that, perhaps due to the individual evolutionary history of each genomic region, geographic proximity is not always a perfect guide for selecting a reference population for an unstudied population. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
22	Efficient genomewide selection of PCA-correlated tSNPs for genotype imputation. Ann Hum Genet 2011;75:707-22. [PMID: 21902678 DOI: 10.1111/j.1469-1809.2011.00673.x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Abstract The linkage disequilibrium structure of the human genome allows identification of small sets of single nucleotide polymorphisms (SNPs) (tSNPs) that efficiently represent dense sets of markers. This structure can be translated into linear algebraic terms as evidenced by the well documented principal components analysis (PCA)-based methods. Here we apply, for the first time, PCA-based methodology for efficient genomewide tSNP selection; and explore the linear algebraic structure of the human genome. Our algorithm divides the genome into contiguous nonoverlapping windows of high linear structure. Coupling this novel window definition with a PCA-based tSNP selection method, we analyze 2.5 million SNPs from the HapMap phase 2 dataset. We show that 10-25% of these SNPs suffice to predict the remaining genotypes with over 95% accuracy. A comparison with other popular methods in the ENCODE regions indicates significant genotyping savings. We evaluate the portability of genome-wide tSNPs across a diverse set of populations (HapMap phase 3 dataset). Interestingly, African populations are good reference populations for the rest of the world. Finally, we demonstrate the applicability of our approach in a real genome-wide disease association study. The chosen tSNP panels can be used toward genotype imputation using either a simple regression-based algorithm or more sophisticated genotype imputation methods. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
23	Tracing cattle breeds with principal components analysis ancestry informative SNPs. PLoS One 2011;6:e18007. [PMID: 21490966 PMCID: PMC3072384 DOI: 10.1371/journal.pone.0018007] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2010] [Accepted: 02/18/2011] [Indexed: 01/09/2023] Open Abstract The recent release of the Bovine HapMap dataset represents the most detailed survey of bovine genetic diversity to date, providing an important resource for the design and development of livestock production. We studied this dataset, comprising more than 30,000 Single Nucleotide Polymorphisms (SNPs) for 19 breeds (13 taurine, three zebu, and three hybrid breeds), seeking to identify small panels of genetic markers that can be used to trace the breed of unknown cattle samples. Taking advantage of the power of Principal Components Analysis and algorithms that we have recently described for the selection of Ancestry Informative Markers from genomewide datasets, we present a decision-tree which can be used to accurately infer the origin of individual cattle. In doing so, we present a thorough examination of population genetic structure in modern bovine breeds. Performing extensive cross-validation experiments, we demonstrate that 250-500 carefully selected SNPs suffice in order to achieve close to 100% prediction accuracy of individual ancestry, when this particular set of 19 breeds is considered. Our methods, coupled with the dense genotypic data that is becoming increasingly available, have the potential to become a valuable tool and have considerable impact in worldwide livestock production. They can be used to inform the design of studies of the genetic basis of economically important traits in cattle, as well as breeding programs and efforts to conserve biodiversity. Furthermore, the SNPs that we have identified can provide a reliable solution for the traceability of breed-specific branded products. Collapse Key Words Collapse MESH Headings Algorithms Animals Breeding Cattle Genotype Phylogeny Polymorphism, Single Nucleotide/genetics Principal Component Analysis/methods Collapse Grants Collapse
24	A note on element-wise matrix sparsification via a matrix-valued Bernstein inequality. INFORM PROCESS LETT 2011. [DOI: 10.1016/j.ipl.2011.01.010] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
25	Atomic-level characterization of the ensemble of the Aβ(1-42) monomer in water using unbiased molecular dynamics simulations and spectral algorithms. J Mol Biol 2010;405:570-83. [PMID: 21056574 DOI: 10.1016/j.jmb.2010.10.015] [Citation(s) in RCA: 186] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2010] [Revised: 10/06/2010] [Accepted: 10/13/2010] [Indexed: 01/05/2023] Abstract Aβ(1-42) is the highly pathologic isoform of amyloid-β, the peptide constituent of fibrils and neurotoxic oligomers involved in Alzheimer's disease. Recent studies on the structural features of Aβ in water have suggested that the system can be described as an ensemble of distinct conformational species in fast exchange. Here, we use replica exchange molecular dynamics (REMD) simulations to characterize the conformations accessible to Aβ42 in explicit water solvent, under the ff99SB force field. Monitoring the correlation between J-coupling((3)J(H(N))(H(α))) and residual dipolar coupling (RDC) data calculated from the REMD trajectories to their experimental values, as determined by NMR, indicates that the simulations converge towards sampling an ensemble that is representative of the experimental data after 60 ns/replica of simulation time. We further validate the converged MD-derived ensemble through direct comparison with (3)J(H(N))(H(α)) and RDC experimental data. Our analysis indicates that the ff99SB-derived REMD ensemble can reproduce the experimental J-coupling values with high accuracy and further provide good agreement with the RDC data. Our results indicate that the peptide is sampling a highly diverse range of conformations: by implementing statistical learning techniques (Laplacian eigenmaps, spectral clustering, and Laplacian scores), we are able to obtain an otherwise hidden structure in the complex conformational space of the peptide. Using these methods, we characterize the peptide conformations and extract their intrinsic characteristics, identify a small number of different conformations that characterize the whole ensemble, and identify a small number of protein interactions (such as contacts between the peptide termini) that are the most discriminative of the different conformations and thus can be used in designing experimental probes of transitions between such molecular states. This is a study of an important intrinsically disordered peptide system that provides an atomic-level description of structural features and interactions that are relevant during the early stages of the oligomerization and fibril nucleation pathways. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
26	Ancestry informative markers for fine-scale individual assignment to worldwide populations. J Med Genet 2010;47:835-47. [PMID: 20921023 DOI: 10.1136/jmg.2010.078212] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Abstract BACKGROUND AND AIMS The analysis of large-scale genetic data from thousands of individuals has revealed the fact that subtle population genetic structure can be detected at levels that were previously unimaginable. Using the Human Genome Diversity Panel as reference (51 populations - 650,000 SNPs), this works describes a systematic evaluation of the resolution that can be achieved for the inference of genetic ancestry, even when small panels of genetic markers are used. METHODS AND RESULTS A comprehensive investigation of human population structure around the world is undertaken by leveraging the power of Principal Components Analysis (PCA). The problem is dissected into hierarchical steps and a decision tree for the prediction of individual ancestry is proposed. A complete leave-one-out validation experiment demonstrates that, using all available SNPs, assignment of individuals to their self-reported populations of origin is essentially perfect. Ancestry informative genetic markers are selected using two different metrics (In and correlation with PCA scores). A thorough cross-validation experiment indicates that, in most cases here, the number of SNPs needed for ancestry inference can be successfully reduced to less than 0.1% of the original 650,000 while retaining close to 100% accuracy. This reduction can be achieved using a novel clustering-based redundancy removal algorithm that is also introduced here. Finally, the applicability of our suggested SNP panels is tested on HapMap Phase 3 populations. CONCLUSION The proposed methods and ancestry informative marker panels, in combination with the increasingly more comprehensive databases of human genetic variation, open new horizons in a variety of fields, ranging from the study of human evolution and population history, to medical genetics and forensics. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
27	Inferring geographic coordinates of origin for Europeans using small panels of ancestry informative markers. PLoS One 2010;5:e11892. [PMID: 20805874 PMCID: PMC2923600 DOI: 10.1371/journal.pone.0011892] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2010] [Accepted: 06/14/2010] [Indexed: 12/31/2022] Open Abstract Recent large-scale studies of European populations have demonstrated the existence of population genetic structure within Europe and the potential to accurately infer individual ancestry when information from hundreds of thousands of genetic markers is used. In fact, when genomewide genetic variation of European populations is projected down to a two-dimensional Principal Components Analysis plot, a surprising correlation with actual geographic coordinates of self-reported ancestry has been reported. This substructure can hamper the search of susceptibility genes for common complex disorders leading to spurious correlations. The identification of genetic markers that can correct for population stratification becomes therefore of paramount importance. Analyzing 1,200 individuals from 11 populations genotyped for more than 500,000 SNPs (Population Reference Sample), we present a systematic exploration of the extent to which geographic coordinates of origin within Europe can be predicted, with small panels of SNPs. Markers are selected to correlate with the top principal components of the dataset, as we have previously demonstrated. Performing thorough cross-validation experiments we show that it is indeed possible to predict individual ancestry within Europe down to a few hundred kilometers from actual individual origin, using information from carefully selected panels of 500 or 1,000 SNPs. Furthermore, we show that these panels can be used to correctly assign the HapMap Phase 3 European populations to their geographic origin. The SNPs that we propose can prove extremely useful in a variety of different settings, such as stratification correction or genetic ancestry testing, and the study of the history of European populations. Collapse Key Words Collapse MESH Headings Algorithms Databases, Factual Europe Genetic Markers/genetics Geography Humans Polymorphism, Single Nucleotide/genetics Principal Component Analysis Reproducibility of Results White People/genetics Collapse Grants Collapse
28	PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet 2007;3:1672-86. [PMID: 17892327 PMCID: PMC1988848 DOI: 10.1371/journal.pgen.0030160] [Citation(s) in RCA: 163] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2007] [Accepted: 08/01/2007] [Indexed: 12/12/2022] Open Abstract Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations. Genetic markers can be used to infer population structure, a task that remains a central challenge in many areas of genetics such as population genetics, and the search for susceptibility genes for common disorders. In such settings, it is often desirable to reduce the number of markers needed for structure identification. Existing methods to identify structure informative markers demand prior knowledge of the membership of the studied individuals to predefined populations. In this paper, based on the properties of a powerful dimensionality reduction technique (Principal Components Analysis), we develop a novel algorithm that does not depend on any prior assumptions and can be used to identify a small set of structure informative markers. Our method is very fast even when applied to datasets of hundreds of individuals and millions of markers. We evaluate this method on a large dataset of 11 populations from around the world, as well as data from the HapMap project. We show that, in most cases, we can achieve 99% genotyping savings while at the same time recovering the structure of the studied populations. Finally, we show that our algorithm can also be successfully applied for the identification of structure informative markers when studying populations of complex ancestry. Collapse Key Words Collapse MESH Headings Algorithms Genetics, Population Humans Polymorphism, Single Nucleotide Principal Component Analysis Collapse Grants K22CA109351 NCI NIH HHS U19 AG023122 NIA NIH HHS R01 HL078885 NHLBI NIH HHS HL078885 NHLBI NIH HHS K22 CA109351 NCI NIH HHS U19 AG23122 NIA NIH HHS Collapse
29	Intra- and interpopulation genotype reconstruction from tagging SNPs. Genome Res 2006;17:96-107. [PMID: 17151345 PMCID: PMC1716273 DOI: 10.1101/gr.5741407] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Abstract The optimal method to be used for tSNP selection, the applicability of a reference LD map to unassayed populations, and the scalability of these methods to genome-wide analysis, all remain subjects of debate. We propose novel, scalable matrix algorithms that address these issues and we evaluate them on genotypic data from 38 populations and four genomic regions (248 SNPs typed for approximately 2000 individuals). We also evaluate these algorithms on a second data set consisting of genotypes available from the HapMap database (1336 SNPs for four populations) over the same genomic regions. Furthermore, we test these methods in the setting of a real association study using a publicly available family data set. The algorithms we use for tSNP selection and unassayed SNP reconstruction do not require haplotype inference and they are, in principle, scalable even to genome-wide analysis. Moreover, they are greedy variants of recently developed matrix algorithms with provable performance guarantees. Using a small set of carefully selected tSNPs, we achieve very good reconstruction accuracy of "untyped" genotypes for most of the populations studied. Additionally, we demonstrate in a quantitative manner that the chosen tSNPs exhibit substantial transferability, both within and across different geographic regions. Finally, we show that reconstruction can be applied to retrieve significant SNP associations with disease, with important genotyping savings. Collapse Key Words Collapse MESH Headings Algorithms Chromosomes, Human, Pair 17/genetics Databases, Nucleic Acid Genotype Homeodomain Proteins/genetics Humans Linkage Disequilibrium Nerve Tissue Proteins Organic Anion Transport Protein 1/genetics Polymorphism, Single Nucleotide Receptors, Cell Surface Receptors, Neuropeptide/genetics Collapse Grants P01 GM057672 NIGMS NIH HHS GM 57672 NIGMS NIH HHS NS 40025 NINDS NIH HHS Collapse
30	Subspace Sampling and Relative-Error Matrix Approximation: Column-Row-Based Methods. LECTURE NOTES IN COMPUTER SCIENCE 2006. [DOI: 10.1007/11841036_29] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
31	Approximating a Gram Matrix for Improved Kernel-Based Learning. LEARNING THEORY 2005. [DOI: 10.1007/11503415_22] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
32	Clustering Large Graphs via the Singular Value Decomposition. Mach Learn 2004. [DOI: 10.1023/b:mach.0000033113.59016.96] [Citation(s) in RCA: 249] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse