1
|
Hong SC, Muyas F, Cortés-Ciriano I, Hormoz S. scAI-SNP: a method for inferring ancestry from single-cell data. BMC METHODS 2025; 2:10. [PMID: 40401145 PMCID: PMC12089154 DOI: 10.1186/s44330-025-00029-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Accepted: 05/01/2025] [Indexed: 05/23/2025]
Abstract
Background Collaborative efforts, such as the Human Cell Atlas, are rapidly accumulating large amounts of single-cell data. To ensure that single-cell atlases are representative of human genetic diversity, we need to determine the ancestry of the donors from whom single-cell data are generated. Self-reporting of race and ethnicity, although important, can be biased and is not always available for the datasets already collected. Methods Here, we introduce scAI-SNP, a tool to infer ancestry directly from single-cell genomics data. To train scAI-SNP, we identified 4.5 million ancestry-informative single-nucleotide polymorphisms (SNPs) in the 1000 Genomes Project dataset across 3201 individuals from 26 population groups. For a query single-cell dataset, scAI-SNP uses these ancestry-informative SNPs to compute the contribution of each of the 26 population groups to the ancestry of the donor from whom the cells were obtained. Results Using diverse single-cell datasets with matched whole-genome sequencing data, we show that scAI-SNP is robust to the sparsity of single-cell data, can accurately and consistently infer ancestry from samples derived from diverse types of tissues and cancer cells, and can be applied to different modalities of single-cell profiling assays, such as single-cell RNA-seq and single-cell ATAC-seq. Discussion Finally, we argue that ensuring that single-cell atlases represent diverse ancestry, ideally alongside race and ethnicity, is ultimately important for improved and equitable health outcomes by accounting for human diversity. Supplementary Information The online version contains supplementary material available at 10.1186/s44330-025-00029-4.
Collapse
Affiliation(s)
- Sung Chul Hong
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215 USA
| | - Francesc Muyas
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD UK
| | - Isidro Cortés-Ciriano
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD UK
| | - Sahand Hormoz
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215 USA
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115 USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
| |
Collapse
|
2
|
Lehmann B, Bräuninger L, Cho Y, Falck F, Jayadeva S, Katell M, Nguyen T, Perini A, Tallman S, Mackintosh M, Silver M, Kuchenbäcker K, Leslie D, Chatterjee N, Holmes C. Methodological opportunities in genomic data analysis to advance health equity. Nat Rev Genet 2025:10.1038/s41576-025-00839-w. [PMID: 40369311 DOI: 10.1038/s41576-025-00839-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/27/2025] [Indexed: 05/16/2025]
Abstract
The causes and consequences of inequities in genomic research and medicine are complex and widespread. However, it is widely acknowledged that underrepresentation of diverse populations in human genetics research risks exacerbating existing health disparities. Efforts to improve diversity are ongoing, but an often-overlooked source of inequity is the choice of analytical methods used to process, analyse and interpret genomic data. This choice can influence all areas of genomic research, from genome-wide association studies and polygenic score development to variant prioritization and functional genomics. New statistical and machine learning techniques to understand, quantify and correct for the impact of biases in genomic data are emerging within the wider genomic research and genomic medicine ecosystems. At this crucial time point, it is important to clarify where improvements in methods and practices can, or cannot, have a role in improving equity in genomics. Here, we review existing approaches to promote equity and fairness in statistical analysis for genomics, and propose future methodological developments that are likely to yield the most impact for equity.
Collapse
Affiliation(s)
- Brieuc Lehmann
- Department of Statistical Science, University College London, London, UK.
| | - Leandra Bräuninger
- Department of Statistical Science, University College London, London, UK
- The Alan Turing Institute, London, UK
| | - Yoonsu Cho
- Genomics England, London, UK
- Medical Research Council Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Fabian Falck
- The Alan Turing Institute, London, UK
- Department of Statistics, University of Oxford, Oxford, UK
| | | | | | | | | | | | | | - Matt Silver
- Genomics England, London, UK
- Medical Research Council Unit The Gambia at the London School of Hygiene & Tropical Medicine, Banjul, The Gambia
| | - Karoline Kuchenbäcker
- Genomics England, London, UK
- Division of Psychiatry, University College London, London, UK
| | | | - Nilanjan Chatterjee
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
- Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Chris Holmes
- Department of Statistics, University of Oxford, Oxford, UK
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| |
Collapse
|
3
|
Smith LA, Cahill JA, Lee JH, Graim K. Equitable machine learning counteracts ancestral bias in precision medicine. Nat Commun 2025; 16:2144. [PMID: 40064867 PMCID: PMC11894161 DOI: 10.1038/s41467-025-57216-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Accepted: 02/05/2025] [Indexed: 03/14/2025] Open
Abstract
Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease. Therapeutics and outcomes remain hidden because we lack insights that could be gained from analyzing ancestrally diverse genomic data. To address this significant gap, we present PhyloFrame, a machine learning method for equitable genomic precision medicine. PhyloFrame corrects for ancestral bias by integrating functional interaction networks and population genomics data with transcriptomic training data. Application of PhyloFrame to breast, thyroid, and uterine cancers shows marked improvements in predictive power across all ancestries, less model overfitting, and a higher likelihood of identifying known cancer-related genes. Validation in fourteen ancestrally diverse datasets demonstrates that PhyloFrame is better able to adjust for ancestry bias across all populations. The ability to provide accurate predictions for underrepresented groups, in particular, is substantially increased. Analysis of performance in the most diverse continental ancestry group, African, illustrates how phylogenetic distance from training data negatively impacts model performance, as well as PhyloFrame's capacity to mitigate these effects. These results demonstrate how equitable artificial intelligence (AI) approaches can mitigate ancestral bias in training data and contribute to equitable representation in medical research.
Collapse
Affiliation(s)
- Leslie A Smith
- Department of Computer & Information Science & Engineering, University of Florida, 1889 Museum Rd, Gainesville, 32611, FL, USA
| | - James A Cahill
- Environmental Engineering Sciences Department, University of Florida, 365 Weil Hall, Gainesville, 32611, FL, USA
- UF Genetics Institute, University of Florida, 2033 Mowry Rd, Gainesville, 32610, FL, USA
| | - Ji-Hyun Lee
- Department of Biostatistics, University of Florida, 2004 Mowry Rd, Gainesville, Gainesville, 32603, FL, USA
- UF Health Cancer Center, University of Florida, 2033 Mowry Rd, Gainesville, 32610, FL, USA
| | - Kiley Graim
- Department of Computer & Information Science & Engineering, University of Florida, 1889 Museum Rd, Gainesville, 32611, FL, USA.
- UF Genetics Institute, University of Florida, 2033 Mowry Rd, Gainesville, 32610, FL, USA.
- UF Health Cancer Center, University of Florida, 2033 Mowry Rd, Gainesville, 32610, FL, USA.
| |
Collapse
|
4
|
Stoneman HR, Price AM, Trout NS, Lamont R, Tifour S, Pozdeyev N, Crooks K, Lin M, Rafaels N, Gignoux CR, Marker KM, Hendricks AE. Characterizing substructure via mixture modeling in large-scale genetic summary statistics. Am J Hum Genet 2025; 112:235-253. [PMID: 39824191 PMCID: PMC11866976 DOI: 10.1016/j.ajhg.2024.12.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 12/09/2024] [Accepted: 12/09/2024] [Indexed: 01/20/2025] Open
Abstract
Genetic summary data are broadly accessible and highly useful, including for risk prediction, causal inference, fine mapping, and incorporation of external controls. However, collapsing individual-level data into summary data, such as allele frequencies, masks intra- and inter-sample heterogeneity, leading to confounding, reduced power, and bias. Ultimately, unaccounted-for substructure limits summary data usability, especially for understudied or admixed populations. There is a need for methods to enable the harmonization of summary data where the underlying substructure is matched between datasets. Here, we present Summix2, a comprehensive set of methods and software based on a computationally efficient mixture model to enable the harmonization of genetic summary data by estimating and adjusting for substructure. In extensive simulations and application to public data, we show that Summix2 characterizes finer-scale population structure, identifies ascertainment bias, and scans for potential regions of selection due to local substructure deviation. Summix2 increases the robust use of diverse, publicly available summary data, resulting in improved and more equitable research.
Collapse
Affiliation(s)
- Hayley R Stoneman
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Adelle M Price
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Nikole Scribner Trout
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Riley Lamont
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Souha Tifour
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Nikita Pozdeyev
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Division of Endocrinology, Diabetes and Metabolism, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Kristy Crooks
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Department of Pathology, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Meng Lin
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Nicholas Rafaels
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Christopher R Gignoux
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Katie M Marker
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Audrey E Hendricks
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA; Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA.
| |
Collapse
|
5
|
Avila MN, Jung S, Satterstrom FK, Fu JM, Levy T, Sloofman LG, Klei L, Pichardo T, Stevens CR, Cusick CM, Ames JL, Campos GS, Cerros H, Chaskel R, Costa CIS, Cuccaro ML, Del Pilar Lopez A, Fernandez M, Ferro E, Galeano L, Girardi ACDES, Griswold AJ, Hernandez LC, Lourenço N, Ludena Y, Nuñez DL, Oyama R, Peña KP, Pessah I, Schmidt R, Sweeney HM, Tolentino L, Wang JYT, Albores-Gallo L, Croen LA, Cruz-Fuentes CS, Hertz-Picciotto I, Kolevzon A, Lattig MC, Mayo L, Passos-Bueno MR, Pericak-Vance MA, Siper PM, Tassone F, Trelles MP, Talkowski ME, Daly MJ, Mahjani B, De Rubeis S, Cook EH, Roeder K, Betancur C, Devlin B, Buxbaum JD. Deleterious coding variation associated with autism is consistent across populations, as exemplified by admixed Latin American populations. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2024.12.27.24319460. [PMID: 39830258 PMCID: PMC11741445 DOI: 10.1101/2024.12.27.24319460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/22/2025]
Abstract
The past decade has seen remarkable progress in identifying genes that, when impacted by deleterious coding variation, confer high risk for autism spectrum disorder (ASD), intellectual disability, and other developmental disorders. However, most underlying gene discovery efforts have focused on individuals of European ancestry, limiting insights into genetic risks across diverse populations. To help address this, the Genomics of Autism in Latin American Ancestries Consortium (GALA) was formed, presenting here the largest sequencing study of ASD in Latin American individuals (n>15,000). We identified 35 genome-wide significant (FDR < 0.05) ASD risk genes, with substantial overlap with findings from European cohorts, and highly constrained genes showing consistent signal across populations. The results provide support for emerging (e.g., MARK2, YWHAG, PACS1, RERE, SPEN, GSE1, GLS, TNPO3, ANKRD17) and established ASD genes, and for the utility of genetic testing approaches for deleterious variants in diverse populations, while also demonstrating the ongoing need for more inclusive genetic research and testing. We conclude that the biology of ASD is universal and not impacted to any detectable degree by ancestry.
Collapse
Affiliation(s)
- Marina Natividad Avila
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Seulgi Jung
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - F Kyle Satterstrom
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
| | - Jack M Fu
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Center for Genomic Medicine, Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| | - Tess Levy
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Laura G Sloofman
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Lambertus Klei
- Department of Psychiatry, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, USA
| | - Thariana Pichardo
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Christine R Stevens
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
| | - Caroline M Cusick
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
| | - Jennifer L Ames
- Division of Research, Kaiser Permanente Northern, Pleasanton, California, USA
| | - Gabriele S Campos
- Centro de Estudos do Genoma Humano e Celulas-Tronco, Departamento de Genetica e Biologia Evolutiva, Biociência, Universidade de São Paulo, São Paulo, Brasil
| | - Hilda Cerros
- Division of Research, Kaiser Permanente Northern, Pleasanton, California, USA
| | - Roberto Chaskel
- Facultad de Medicina, Universidad de los Andes, Bogota, Colombia
- Instituto Colombiano del Sistema Nervioso, Clinica Montserrat, Bogota, Colombia
| | - Claudia I S Costa
- Centro de Estudos do Genoma Humano e Celulas-Tronco, Departamento de Genetica e Biologia Evolutiva, Biociência, Universidade de São Paulo, São Paulo, Brasil
| | - Michael L Cuccaro
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, Florida, USA
- The Dr. John T. Macdonald Foundation Department of Human Genetics, University of Miami Miller School of Medicine, Miami, Florida, USA
| | | | - Magdalena Fernandez
- Instituto Colombiano del Sistema Nervioso, Clinica Montserrat, Bogota, Colombia
| | - Eugenio Ferro
- Instituto Colombiano del Sistema Nervioso, Clinica Montserrat, Bogota, Colombia
| | - Liliana Galeano
- Facultad de Ciencias, Universidad de los Andes, Bogotá, Colombia
| | - Ana Cristina D E S Girardi
- Centro de Estudos do Genoma Humano e Celulas-Tronco, Departamento de Genetica e Biologia Evolutiva, Biociência, Universidade de São Paulo, São Paulo, Brasil
| | - Anthony J Griswold
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, Florida, USA
- The Dr. John T. Macdonald Foundation Department of Human Genetics, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Luis C Hernandez
- Facultad de Ciencias, Universidad de los Andes, Bogotá, Colombia
| | - Naila Lourenço
- Centro de Estudos do Genoma Humano e Celulas-Tronco, Departamento de Genetica e Biologia Evolutiva, Biociência, Universidade de São Paulo, São Paulo, Brasil
| | - Yunin Ludena
- MIND (Medical Investigation of Neurodevelopmental Disorders) Institute, University of California Davis, Davis, California, USA
| | - Diana L Nuñez
- Department of Psychiatry, Yale University School of Medicine, New Haven, Connecticut, USA
- National Center of Posttraumatic Stress Disorders, VA CT Healthcare Center, West Haven, Connecticut, USA
| | - Rosa Oyama
- Centro Ann Sullivan del Peru, Lima, Peru
| | - Katherine P Peña
- Facultad de Ciencias, Universidad de los Andes, Bogotá, Colombia
| | - Isaac Pessah
- MIND (Medical Investigation of Neurodevelopmental Disorders) Institute, University of California Davis, Davis, California, USA
| | - Rebecca Schmidt
- MIND (Medical Investigation of Neurodevelopmental Disorders) Institute, University of California Davis, Davis, California, USA
| | | | | | - Jaqueline Y T Wang
- Centro de Estudos do Genoma Humano e Celulas-Tronco, Departamento de Genetica e Biologia Evolutiva, Biociência, Universidade de São Paulo, São Paulo, Brasil
| | - Lilia Albores-Gallo
- Hospital Psiquiátrico Infantil Dr. Juan N. Navarro, Ciudad de México, Mexico
- Universidad Nacional Autónoma de México, Ciudad de México, Mexico
| | - Lisa A Croen
- Division of Research, Kaiser Permanente Northern, Pleasanton, California, USA
- Kaiser Permanente School of Medicine, Pasadena, California, USA
| | - Carlos S Cruz-Fuentes
- Departamento de Genética, Subdirección de Investigaciones Clínicas, Instituto Nacional de Psiquiatría Ramón de la Fuente Muñiz México, Ciudad de México, Mexico
| | - Irva Hertz-Picciotto
- MIND (Medical Investigation of Neurodevelopmental Disorders) Institute, University of California Davis, Davis, California, USA
| | - Alexander Kolevzon
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Pediatrics, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Maria C Lattig
- Facultad de Ciencias, Universidad de los Andes, Bogotá, Colombia
| | | | - Maria Rita Passos-Bueno
- Centro de Estudos do Genoma Humano e Celulas-Tronco, Departamento de Genetica e Biologia Evolutiva, Biociência, Universidade de São Paulo, São Paulo, Brasil
| | - Margaret A Pericak-Vance
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, Florida, USA
- The Dr. John T. Macdonald Foundation Department of Human Genetics, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Paige M Siper
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Flora Tassone
- MIND (Medical Investigation of Neurodevelopmental Disorders) Institute, University of California Davis, Davis, California, USA
- Department of Biochemistry and Molecular Medicine, University of California Davis, School of Medicine, Davis, California, USA
| | - M Pilar Trelles
- Psychiatry and Behavioral Sciences, Boston Children's Hospital, Boston, Massachusetts, USA
| | - Michael E Talkowski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Center for Genomic Medicine, Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
- Program in Bioinformatics and Integrative Genomics, Harvard Medical School, Boston, Massachusetts, USA
| | - Mark J Daly
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Center for Genomic Medicine, Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
- Department of Medicine, Harvard Medical School, Boston, Massachusetts, USA
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Behrang Mahjani
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm
| | - Silvia De Rubeis
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Alper Center for Neural Development and Regeneration, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Edwin H Cook
- Department of Psychiatry, University of Illinois Chicago, Chicago, Illinois, USA
| | - Kathryn Roeder
- Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Catalina Betancur
- Sorbonne Université, INSERM, CNRS, Neuroscience Paris Seine, Institut de Biologie Paris Seine, Paris, France
| | - Bernie Devlin
- Department of Psychiatry, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, USA
| | - Joseph D Buxbaum
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
6
|
Schneider K, Chowdhury M, Tepper M, Khan J, Shortt JA, Gignoux C, Layer R. GenoSiS: A Biobank-Scale Genotype Similarity Search Architecture for Creating Dynamic Patient-Match Cohorts. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.02.621671. [PMID: 39554195 PMCID: PMC11565994 DOI: 10.1101/2024.11.02.621671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
Many patients do not experience optimal benefits from medical advances because clinical research does not adequately represent them. While the diversity of biomedical research cohorts is improving, ensuring that individual patients are adequately represented remains challenging. We propose a new approach, GenoSiS, which leverages machine learning-based similarity search to dynamically find patient-matched cohorts across different populations quickly. These cohorts could serve as reference cohorts to improve a range of clinical analyses, including disease risk score calculations and dosage decisions. While GenoSiS focuses on finding genetic similarity within a biobank, our similarity search architecture can be extended to represent other medically relevant patient characteristics and search other biobanks.
Collapse
|
7
|
Dennis T, Lee D. ZMIX: estimating ancestry proportions using GWAS association Z-scores. BIOINFORMATICS ADVANCES 2024; 4:vbae128. [PMID: 39664860 PMCID: PMC11632184 DOI: 10.1093/bioadv/vbae128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 07/19/2024] [Accepted: 08/27/2024] [Indexed: 12/13/2024]
Abstract
Motivation With larger and more diverse studies becoming the standard in genome-wide association studies (GWAS), accurate estimation of ancestral proportions is increasingly important for summary-statistics-based methods such as those for imputing association summary statistics, adjusting allele frequencies (AFs) for ancestry, and prioritizing disease candidate variants or genes. Existing methods for estimating ancestral proportions in GWAS rely on the availability of study reference AFs, which are often inaccessible in current GWAS due to privacy concerns. Results In this study, we propose ZMIX (Z-score-based estimation of ethnic MIXing proportions), a novel method for estimating ethnic mixing proportions in GWAS using only association Z-scores, and we compare its performance to existing reference AF-based methods in both real-world and simulated GWAS settings. We found that ZMIX offered comparable results to the reference AF-based methods in simulation and real-world studies. When applied to summary-statistics imputation, all three methods produced high-quality imputations with almost identical results. Availability and implementation https://github.com/statsleelab/gauss.
Collapse
Affiliation(s)
- Trent Dennis
- Department of Statistics, Miami University, Oxford, OH 45056, United States
- Winton Hill Business Center, P&G, Cincinnati, OH 45232, United States
| | - Donghyung Lee
- Department of Statistics, Miami University, Oxford, OH 45056, United States
| |
Collapse
|
8
|
Xu H, Zhang G, Chen J. A novel method for cell deconvolution using DNA methylation in PCA space. BMC Genomics 2024; 25:798. [PMID: 39179972 PMCID: PMC11344294 DOI: 10.1186/s12864-024-10652-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 07/22/2024] [Indexed: 08/26/2024] Open
Abstract
BACKGROUND In this study, we present a novel method for reference-based cell deconvolution using data from DNA methylation arrays. Different from existing methods like IDOL-Ext, which operate on probe-level data, our approach represents features in the principal component analysis (PCA) space for cell type deconvolution. RESULTS Our method's accuracy in estimating cell compositions is validated across various public datasets, including blood samples from glioma patients. It demonstrates precision comparable to IDOL-Ext, with R2 values ranging from 0.73 to 0.99 for most cell types, while offering improved discrimination between similar cell types, particularly T cell subtypes in glioma patient samples (R2 0.42-0.75 vs. 0.36-0.66 for IDOL-Ext). However, both methods showed lower accuracy for certain cell types, such as memory CD8 T cells in glioma patients (R2 0.42 vs. 0.36 for IDOL-Ext), highlighting the challenges in distinguishing closely related cell populations. We have made this method available as an R package "BloodCellDecon" on GitHub. CONCLUSIONS Our study confirms the efficacy of cell type deconvolution in PCA space. The results indicate wide-ranging applicability and potential for adaptation to other forms of genomic data.
Collapse
Affiliation(s)
- Huan Xu
- Division of Human Genetics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Ge Zhang
- Division of Human Genetics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, USA
- Center for Prevention of Preterm Birth, Perinatal Institute, Cincinnati Children's Hospital Medical Center and March of Dimes Prematurity Research Center Ohio Collaborative, Cincinnati, OH, USA
| | - Jing Chen
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, USA.
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA.
| |
Collapse
|
9
|
Hong SC, Muyas F, Cortés-Ciriano I, Hormoz S. scAI-SNP: a method for inferring ancestry from single-cell data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.14.594208. [PMID: 38798590 PMCID: PMC11118306 DOI: 10.1101/2024.05.14.594208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Collaborative efforts, such as the Human Cell Atlas, are rapidly accumulating large amounts of single-cell data. To ensure that single-cell atlases are representative of human genetic diversity, we need to determine the ancestry of the donors from whom single-cell data are generated. Self-reporting of race and ethnicity, although important, can be biased and is not always available for the datasets already collected. Here, we introduce scAI-SNP, a tool to infer ancestry directly from single-cell genomics data. To train scAI-SNP, we identified 4.5 million ancestry-informative single-nucleotide polymorphisms (SNPs) in the 1000 Genomes Project dataset across 3201 individuals from 26 population groups. For a query single-cell data set, scAI-SNP uses these ancestry-informative SNPs to compute the contribution of each of the 26 population groups to the ancestry of the donor from whom the cells were obtained. Using diverse single-cell data sets with matched whole-genome sequencing data, we show that scAI-SNP is robust to the sparsity of single-cell data, can accurately and consistently infer ancestry from samples derived from diverse types of tissues and cancer cells, and can be applied to different modalities of single-cell profiling assays, such as single-cell RNA-seq and single-cell ATAC-seq. Finally, we argue that ensuring that single-cell atlases represent diverse ancestry, ideally alongside race and ethnicity, is ultimately important for improved and equitable health outcomes by accounting for human diversity.
Collapse
Affiliation(s)
- Sung Chul Hong
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Francesc Muyas
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Isidro Cortés-Ciriano
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Sahand Hormoz
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
10
|
Stoneman HR, Price A, Trout NS, Lamont R, Tifour S, Pozdeyev N, Crooks K, Lin M, Rafaels N, Gignoux CR, Marker KM, Hendricks AE. Characterizing substructure via mixture modeling in large-scale genetic summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.29.577805. [PMID: 38766180 PMCID: PMC11100604 DOI: 10.1101/2024.01.29.577805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Genetic summary data are broadly accessible and highly useful including for risk prediction, causal inference, fine mapping, and incorporation of external controls. However, collapsing individual-level data into groups masks intra- and inter-sample heterogeneity, leading to confounding, reduced power, and bias. Ultimately, unaccounted substructure limits summary data usability, especially for understudied or admixed populations. Here, we present Summix2, a comprehensive set of methods and software based on a computationally efficient mixture model to estimate and adjust for substructure in genetic summary data. In extensive simulations and application to public data, Summix2 characterizes finer-scale population structure, identifies ascertainment bias, and identifies potential regions of selection due to local substructure deviation. Summix2 increases the robust use of diverse publicly available summary data resulting in improved and more equitable research.
Collapse
Affiliation(s)
- Hayley R Stoneman
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Adelle Price
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Nikole Scribner Trout
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Riley Lamont
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Souha Tifour
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| | - Nikita Pozdeyev
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Division of Endocrinology, Diabetes and Metabolism, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Kristy Crooks
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Department of Pathology, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Meng Lin
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Nicholas Rafaels
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Christopher R Gignoux
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Katie M Marker
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Audrey E Hendricks
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO 80204, USA
| |
Collapse
|
11
|
Lee D, Bacanu SA. GAUSS: a summary-statistics-based R package for accurate estimation of linkage disequilibrium for variants, Gaussian imputation, and TWAS analysis of cosmopolitan cohorts. Bioinformatics 2024; 40:btae203. [PMID: 38632050 PMCID: PMC11052653 DOI: 10.1093/bioinformatics/btae203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/25/2024] [Accepted: 04/16/2024] [Indexed: 04/19/2024] Open
Abstract
MOTIVATION As the availability of larger and more ethnically diverse reference panels grows, there is an increase in demand for ancestry-informed imputation of genome-wide association studies (GWAS), and other downstream analyses, e.g. fine-mapping. Performing such analyses at the genotype level is computationally challenging and necessitates, at best, a laborious process to access individual-level genotype and phenotype data. Summary-statistics-based tools, not requiring individual-level data, provide an efficient alternative that streamlines computational requirements and promotes open science by simplifying the re-analysis and downstream analysis of existing GWAS summary data. However, existing tools perform only disparate parts of needed analysis, have only command-line interfaces, and are difficult to extend/link by applied researchers. RESULTS To address these challenges, we present Genome Analysis Using Summary Statistics (GAUSS)-a comprehensive and user-friendly R package designed to facilitate the re-analysis/downstream analysis of GWAS summary statistics. GAUSS offers an integrated toolkit for a range of functionalities, including (i) estimating ancestry proportion of study cohorts, (ii) calculating ancestry-informed linkage disequilibrium, (iii) imputing summary statistics of unobserved variants, (iv) conducting transcriptome-wide association studies, and (v) correcting for "Winner's Curse" biases. Notably, GAUSS utilizes an expansive, multi-ethnic reference panel consisting of 32 953 genomes from 29 ethnic groups. This panel enhances the range and accuracy of imputable variants, including the ability to impute summary statistics of rarer variants. As a result, GAUSS elevates the quality and applicability of existing GWAS analyses without requiring access to subject-level genotypic and phenotypic information. AVAILABILITY AND IMPLEMENTATION The GAUSS R package, complete with its source code, is readily accessible to the public via our GitHub repository at https://github.com/statsleelab/gauss. To further assist users, we provided illustrative use-case scenarios that are conveniently found at https://statsleelab.github.io/gauss/, along with a comprehensive user guide detailed in Supplementary Text S1.
Collapse
Affiliation(s)
- Donghyung Lee
- Department of Statistics, Miami University, Oxford, OH 45056, United States
| | - Silviu-Alin Bacanu
- Department of Psychiatry, Virginia Commonwealth University, Richmond, VA 23298, United States
| |
Collapse
|
12
|
Artomov M, Loboda AA, Artyomov MN, Daly MJ. Public platform with 39,472 exome control samples enables association studies without genotype sharing. Nat Genet 2024; 56:327-335. [PMID: 38200129 PMCID: PMC10864173 DOI: 10.1038/s41588-023-01637-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Accepted: 12/01/2023] [Indexed: 01/12/2024]
Abstract
Acquiring a sufficiently powered cohort of control samples matched to a case sample can be time-consuming or, in some cases, impossible. Accordingly, an ability to leverage genetic data from control samples that were already collected elsewhere could dramatically improve power in genetic association studies. Sharing of control samples can pose significant challenges, since most human genetic data are subject to strict sharing regulations. Here, using the properties of singular value decomposition and subsampling algorithm, we developed a method allowing selection of the best-matching controls in an external pool of samples compliant with personal data protection and eliminating the need for genotype sharing. We provide access to a library of 39,472 exome sequencing controls at http://dnascore.net enabling association studies for case cohorts lacking control subjects. Using this approach, control sets can be selected from this online library with a prespecified matching accuracy, ensuring well-calibrated association analysis for both rare and common variants.
Collapse
Affiliation(s)
- Mykyta Artomov
- Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA.
- Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA.
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
- Broad Institute, Cambridge, MA, USA.
| | - Alexander A Loboda
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute, Cambridge, MA, USA
- ITMO University, St. Petersburg, Russia
- Almazov National Medical Research Center, St. Petersburg, Russia
| | - Maxim N Artyomov
- Department of Immunology and Pathology, Washington University in St. Louis, St. Louis, MO, USA
| | - Mark J Daly
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
- Broad Institute, Cambridge, MA, USA.
- Institute for Molecular Medicine Finland, Helsinki, Finland.
| |
Collapse
|
13
|
Smith LA, Cahill JA, Graim K. Equitable machine learning counteracts ancestral bias in precision medicine, improving outcomes for all. RESEARCH SQUARE 2023:rs.3.rs-3168446. [PMID: 37546907 PMCID: PMC10402189 DOI: 10.21203/rs.3.rs-3168446/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease [1-8]. Therapeutics and outcomes remain hidden because we lack insights that we could gain from analyzing ancestry-unbiased genomic data. To address this significant gap, we present PhyloFrame, the first-ever machine learning method for equitable genomic precision medicine. PhyloFrame corrects for ancestral bias by integrating big data tissue-specific functional interaction networks, global population variation data, and disease-relevant transcriptomic data. Application of PhyloFrame to breast, thyroid, and uterine cancers shows marked improvements in predictive power across all ancestries, less model overfitting, and a higher likelihood of identifying known cancer-related genes. The ability to provide accurate predictions for underrepresented groups, in particular, is substantially increased. These results demonstrate how AI can mitigate ancestral bias in training data and contribute to equitable representation in medical research.
Collapse
Affiliation(s)
- Leslie A Smith
- Department of Computer & Information Science & Engineering, University of Florida, 432 Newell Dr, Gainesville, 32611, FL, USA
| | - James A Cahill
- Environmental Engineering Sciences Department, University of Florida, 432 Newell Dr, Gainesville, 32611, FL, USA
| | - Kiley Graim
- Department of Computer & Information Science & Engineering, University of Florida, 432 Newell Dr, Gainesville, 32611, FL, USA
| |
Collapse
|
14
|
Wojcik GL, Murphy J, Edelson JL, Gignoux CR, Ioannidis AG, Manning A, Rivas MA, Buyske S, Hendricks AE. Opportunities and challenges for the use of common controls in sequencing studies. Nat Rev Genet 2022; 23:665-679. [PMID: 35581355 PMCID: PMC9765323 DOI: 10.1038/s41576-022-00487-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/22/2022] [Indexed: 01/02/2023]
Abstract
Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.
Collapse
Affiliation(s)
- Genevieve L Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Jessica Murphy
- Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA
| | - Jacob L Edelson
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, USA
| | - Christopher R Gignoux
- Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Alexander G Ioannidis
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, USA
- Clinical and Translational Epidemiology Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Alisa Manning
- Metabolism Program, Broad Institute, Cambridge, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Manuel A Rivas
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, USA
| | - Steven Buyske
- Department of Statistics, Rutgers University, Piscataway, NJ, USA
| | - Audrey E Hendricks
- Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA.
- Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA.
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
| |
Collapse
|
15
|
Privé F. Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics 2022; 38:3477-3480. [PMID: 35604078 PMCID: PMC9237724 DOI: 10.1093/bioinformatics/btac348] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Revised: 05/11/2022] [Accepted: 05/18/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Measuring genetic diversity is an important problem because increasing genetic diversity is a key to making new genetic discoveries, while also being a major source of confounding to be aware of in genetics studies. RESULTS Using the UK Biobank data, a prospective cohort study with deep genetic and phenotypic data collected on almost 500 000 individuals from across the UK, we carefully define 21 distinct ancestry groups from all four corners of the world. These ancestry groups can serve as a global reference of worldwide populations, with a handful of applications. Here, we develop a method that uses allele frequencies and principal components derived from these ancestry groups to effectively measure ancestry proportions from allele frequencies of any genetic dataset. AVAILABILITY AND IMPLEMENTATION This method is implemented in function snp_ancestry_summary of R package bigsnpr. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Florian Privé
- National Centre for Register-based Research, Aarhus University, Aarhus 8210, Denmark
| |
Collapse
|
16
|
Balagué-Dobón L, Cáceres A, González JR. Fully exploiting SNP arrays: a systematic review on the tools to extract underlying genomic structure. Brief Bioinform 2022; 23:bbac043. [PMID: 35211719 PMCID: PMC8921734 DOI: 10.1093/bib/bbac043] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 01/25/2022] [Accepted: 01/28/2022] [Indexed: 12/12/2022] Open
Abstract
Single nucleotide polymorphisms (SNPs) are the most abundant type of genomic variation and the most accessible to genotype in large cohorts. However, they individually explain a small proportion of phenotypic differences between individuals. Ancestry, collective SNP effects, structural variants, somatic mutations or even differences in historic recombination can potentially explain a high percentage of genomic divergence. These genetic differences can be infrequent or laborious to characterize; however, many of them leave distinctive marks on the SNPs across the genome allowing their study in large population samples. Consequently, several methods have been developed over the last decade to detect and analyze different genomic structures using SNP arrays, to complement genome-wide association studies and determine the contribution of these structures to explain the phenotypic differences between individuals. We present an up-to-date collection of available bioinformatics tools that can be used to extract relevant genomic information from SNP array data including population structure and ancestry; polygenic risk scores; identity-by-descent fragments; linkage disequilibrium; heritability and structural variants such as inversions, copy number variants, genetic mosaicisms and recombination histories. From a systematic review of recently published applications of the methods, we describe the main characteristics of R packages, command-line tools and desktop applications, both free and commercial, to help make the most of a large amount of publicly available SNP data.
Collapse
|