1
|
Abstract
Amelogenin plays a crucial role in tooth enamel formation, and mutations on X-chromosomal amelogenin cause X-linked amelogenesis imperfecta (AI). Amelogenin pre-messenger RNA (mRNA) is highly alternatively spliced, and during alternative splicing, exon4 is mostly skipped, leading to the formation of a microRNA (miR-exon4) that has been suggested to function in enamel and bone formation. While delivering the functional variation of amelogenin proteins, alternative splicing of exon4 is the decisive first step to producing miR-exon4. However, the factors that regulate the splicing of exon4 are not well understood. This study aimed to investigate the association between known mutations in exon4 and exon5 of X chromosome amelogenin that causes X-linked AI, the splicing of exon4, and miR-exon4 formation. Our results showed mutations in exon4 and exon5 of the amelogenin gene, including c.120T>C, c.152C>T, c.155C>G, and c.155delC, significantly affected the splicing of exon4 and subsequent miR-exon4 production. Using an amelogenin minigene transfected in HEK-293 cells, we observed increased inclusion of exon4 in amelogenin mRNA and reduced miR-exon4 production with these mutations. In silico analysis predicted that Ser/Arg-rich RNA splicing factor (SRSF) 2 and SRSF5 were the regulatory factors for exon4 and exon5 splicing, respectively. Electrophoretic mobility shift assay confirmed that SRSF2 binds to exon4 and SRSF5 binds to exon5, and mutations in each exon can alter SRSF binding. Transfection of the amelogenin minigene to LS8 ameloblastic cells suppressed expression of the known miR-exon4 direct targets, Nfia and Prkch, related to multiple pathways. Given the mutations on the minigene, the expression of Prkch has been significantly upregulated with c.155C>G and c.155delC mutations. Together, we confirmed that exon4 splicing is critical for miR-exon4 production, and mutations causing X-linked AI in exon4 and exon5 significantly affect exon4 splicing and following miR-exon4 production. The change in miR-exon4 would be an additional etiology of enamel defects seen in some X-linked AI.
Collapse
Affiliation(s)
- R. Shemirani
- Department of Orofacial Sciences, School of Dentistry, University of California, San Francisco, CA, USA
- Oral and Craniofacial Science, Graduate Division, University of California, San Francisco, CA, USA
| | - M.H. Le
- Oral and Craniofacial Science, Graduate Division, University of California, San Francisco, CA, USA
- Department of Preventive and Restorative Dental Sciences, University of California, San Francisco, CA, USA
- College of Dental Medicine, California Northstate University, Elk Grove, CA, USA
| | - Y. Nakano
- Department of Orofacial Sciences, School of Dentistry, University of California, San Francisco, CA, USA
- Center for Children’s Oral Health Research, School of Dentistry, University of California, San Francisco, CA, USA
| |
Collapse
|
2
|
Cullina S, Wojcik GL, Shemirani R, Klarin D, Gorman BR, Sorokin EP, Gignoux CR, Belbin GM, Pyarajan S, Asgari S, Tsao PS, Damrauer SM, Abul-Husn NS, Kenny EE. Admixture mapping of peripheral artery disease in a Dominican population reveals a putative risk locus on 2q35. Front Genet 2023; 14:1181167. [PMID: 37600667 PMCID: PMC10432698 DOI: 10.3389/fgene.2023.1181167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 07/10/2023] [Indexed: 08/22/2023] Open
Abstract
Peripheral artery disease (PAD) is a form of atherosclerotic cardiovascular disease, affecting ∼8 million Americans, and is known to have racial and ethnic disparities. PAD has been reported to have a significantly higher prevalence in African Americans (AAs) compared to non-Hispanic European Americans (EAs). Hispanic/Latinos (HLs) have been reported to have lower or similar rates of PAD compared to EAs, despite having a paradoxically high burden of PAD risk factors; however, recent work suggests prevalence may differ between sub-groups. Here, we examined a large cohort of diverse adults in the BioMe biobank in New York City. We observed the prevalence of PAD at 1.7% in EAs vs. 8.5% and 9.4% in AAs and HLs, respectively, and among HL sub-groups, the prevalence was found at 11.4% and 11.5% in Puerto Rican and Dominican populations, respectively. Follow-up analysis that adjusted for common risk factors demonstrated that Dominicans had the highest increased risk for PAD relative to EAs [OR = 3.15 (95% CI 2.33-4.25), p < 6.44 × 10-14]. To investigate whether genetic factors may explain this increased risk, we performed admixture mapping by testing the association between local ancestry and PAD in Dominican BioMe participants (N = 1,813) separately from European, African, and Native American (NAT) continental ancestry tracts. The top association with PAD was an NAT ancestry tract at chromosome 2q35 [OR = 1.96 (SE = 0.16), p < 2.75 × 10-05) with 22.6% vs. 12.9% PAD prevalence in heterozygous NAT tract carriers versus non-carriers, respectively. Fine-mapping at this locus implicated tag SNP rs78529201 located within a long intergenic non-coding RNA (lincRNA) LINC00607, a gene expression regulator of key genes related to thrombosis and extracellular remodeling of endothelial cells, suggesting a putative link of the 2q35 locus to PAD etiology. Efforts to reproduce the signal in other Hispanic cohorts were unsuccessful. In summary, we showed how leveraging health system data helped understand nuances of PAD risk across HL sub-groups and admixture mapping approaches elucidated a putative risk locus in a Dominican population.
Collapse
Affiliation(s)
- Sinead Cullina
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Genevieve L. Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States
| | - Ruhollah Shemirani
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Derek Klarin
- VA Palo Alto Healthcare System, Palo Alto, CA, United States
- Division of Vascular Surgery, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Bryan R. Gorman
- Center for Data and Computational Sciences (C-DACS), VA Boston Healthcare System, Boston, MA, United States
- Booz Allen Hamilton, McLean, VA, United States
| | - Elena P. Sorokin
- Department of Genetics, Stanford University, Stanford, CA, United States
| | - Christopher R. Gignoux
- Human Medical Genetics and Genomics Program, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States
- Colorado Center for Personalized Medicine, Aurora, CO, United States
| | - Gillian M. Belbin
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Division of General Internal Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Saiju Pyarajan
- Center for Data and Computational Sciences (C-DACS), VA Boston Healthcare System, Boston, MA, United States
- Department of Medicine, Brigham Women’s Hospital, Harvard Medical School, Boston, MA, United States
| | - Samira Asgari
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Philip S. Tsao
- VA Palo Alto Healthcare System, Palo Alto, CA, United States
| | - Scott M. Damrauer
- Corporal Michael J. Crescenz VA Medical Center, Philadelphia, PA, United States
- Department of Surgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
- Department of Genetics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, United States
| | - Noura S. Abul-Husn
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Division of Genomic Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Eimear E. Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Division of General Internal Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Division of Genomic Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
3
|
Caggiano C, Boudaie A, Shemirani R, Mefford J, Petter E, Chiu A, Ercelen D, He R, Tward D, Paul KC, Chang TS, Pasaniuc B, Kenny EE, Shortt JA, Gignoux CR, Balliu B, Arboleda VA, Belbin G, Zaitlen N. Disease risk and healthcare utilization among ancestrally diverse groups in the Los Angeles region. Nat Med 2023; 29:1845-1856. [PMID: 37464048 DOI: 10.1038/s41591-023-02425-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 05/30/2023] [Indexed: 07/20/2023]
Abstract
An individual's disease risk is affected by the populations that they belong to, due to shared genetics and environmental factors. The study of fine-scale populations in clinical care is important for identifying and reducing health disparities and for developing personalized interventions. To assess patterns of clinical diagnoses and healthcare utilization by fine-scale populations, we leveraged genetic data and electronic medical records from 35,968 patients as part of the UCLA ATLAS Community Health Initiative. We defined clusters of individuals using identity by descent, a form of genetic relatedness that utilizes shared genomic segments arising due to a common ancestor. In total, we identified 376 clusters, including clusters with patients of Afro-Caribbean, Puerto Rican, Lebanese Christian, Iranian Jewish and Gujarati ancestry. Our analysis uncovered 1,218 significant associations between disease diagnoses and clusters and 124 significant associations with specialty visits. We also examined the distribution of pathogenic alleles and found 189 significant alleles at elevated frequency in particular clusters, including many that are not regularly included in population screening efforts. Overall, this work progresses the understanding of health in understudied communities and can provide the foundation for further study into health inequities.
Collapse
Affiliation(s)
- Christa Caggiano
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA
| | | | - Ruhollah Shemirani
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Joel Mefford
- Semel Institute for Neuroscience and Human Behavior, University of California, Los Angeles, Los Angeles, CA, USA
| | - Ella Petter
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Alec Chiu
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA, USA
| | - Defne Ercelen
- Computational and Systems Biology Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, USA
| | - Rosemary He
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Daniel Tward
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Kimberly C Paul
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA
| | - Timothy S Chang
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA
| | - Bogdan Pasaniuc
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Institute of Precision Health, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Pathology and Laboratory Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Jonathan A Shortt
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Division of Bioinformatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Christopher R Gignoux
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Division of Bioinformatics and Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Valerie A Arboleda
- Department of Pathology and Laboratory Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA
| | - Gillian Belbin
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Noah Zaitlen
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA.
| |
Collapse
|
4
|
Cullina S, Wojcik GL, Shemirani R, Klarin D, Gorman BR, Sorokin EP, Gignoux CR, Belbin GM, Pyarajan S, Asgari S, Tsao PS, Damrauer SM, Abul-Husn NS, Kenny EE. Admixture Mapping of Peripheral Artery Disease in a Dominican Population Reveals a Novel Risk Locus on 2q35. medRxiv 2023:2023.03.27.23287788. [PMID: 37034679 PMCID: PMC10081406 DOI: 10.1101/2023.03.27.23287788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Peripheral artery disease (PAD) is a form of atherosclerotic cardiovascular disease, affecting ∼8 million Americans, and is known to have racial and ethnic disparities. PAD has been reported to have significantly higher prevalence in African Americans (AAs) compared to non-Hispanic European Americans (EAs). Hispanic/Latinos (HLs) have been reported to have lower or similar rates of PAD compared to EAs, despite having a paradoxically high burden of PAD risk factors, however recent work suggests prevalence may differ between sub-groups. Here we examined a large cohort of diverse adults in the Bio Me biobank in New York City (NYC). We observed the prevalence of PAD at 1.7% in EAs vs 8.5% and 9.4% in AAs and HLs, respectively; and among HL sub-groups, at 11.4% and 11.5% in Puerto Rican and Dominican populations, respectively. Follow-up analysis that adjusted for common risk factors demonstrated that Dominicans had the highest increased risk for PAD relative to EAs (OR=3.15 (95% CI 2.33-4.25), P <6.44×10 -14 ). To investigate whether genetic factors may explain this increased risk, we performed admixture mapping by testing the association between local ancestry (LA) and PAD in Dominican Bio Me participants (N=1,940) separately for European (EUR), African (AFR) and Native American (NAT) continental ancestry tracts. We identified a NAT ancestry tract at chromosome 2q35 that was significantly associated with PAD (OR=2.05 (95% CI 1.51-2.78), P <4.06×10 -6 ) with 22.5% vs 12.5% PAD prevalence in heterozygous NAT tract carriers versus non-carriers, respectively. Fine-mapping at this locus implicated tag SNP rs78529201 located within a long intergenic non-coding RNA (lincRNA) LINC00607 , a gene expression regulator of key genes related to thrombosis and extracellular remodeling of endothelial cells, suggesting a putative link of the 2q35 locus to PAD etiology. In summary, we showed how leveraging health systems data helped understand nuances of PAD risk across HL sub-groups and admixture mapping approaches elucidated a novel risk locus in a Dominican population.
Collapse
|
5
|
Shemirani R, Belbin GM, Burghardt K, Lerman K, Avery CL, Kenny EE, Gignoux CR, Ambite JL. Selecting Clustering Algorithms for Identity-By-Descent Mapping. Pac Symp Biocomput 2023; 28:121-132. [PMID: 36540970 PMCID: PMC9782725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare the statistical power of clustering algorithms via simulating 2.3 million clusters across 850 experiments. We found Infomap and Markov Clustering (MCL) community detection methods to have high statistical power in most of the scenarios. They yield a 30% increase in power compared to the current state-of-art approach, with a 3 orders of magnitude lower runtime. We also found that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications. We extend our findings to real datasets by analyzing the Population Architecture using Genomics and Epidemiology (PAGE) Study dataset with 51,000 samples and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters. We demonstrate the power of our approach by recovering signals of rare genetic variation in the Whole-Exome Sequence data of 200,000 individuals in the UK Biobank. We provide an efficient implementation to enable clustering at scale for IBD mapping for various populations and scenarios.Supplementary Information: The code, along with supplementary methods and figures are available at https://github.com/roohy/localIBDClustering.
Collapse
Affiliation(s)
- Ruhollah Shemirani
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA,
| | | | | | | | | | | | | | | |
Collapse
|
6
|
Belbin GM, Cullina S, Wenric S, Soper ER, Glicksberg BS, Torre D, Moscati A, Wojcik GL, Shemirani R, Beckmann ND, Cohain A, Sorokin EP, Park DS, Ambite JL, Ellis S, Auton A, Bottinger EP, Cho JH, Loos RJF, Abul-Husn NS, Zaitlen NA, Gignoux CR, Kenny EE. Toward a fine-scale population health monitoring system. Cell 2021; 184:2068-2083.e11. [PMID: 33861964 DOI: 10.1016/j.cell.2021.03.034] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 11/18/2020] [Accepted: 03/12/2021] [Indexed: 12/22/2022]
Abstract
Understanding population health disparities is an essential component of equitable precision health efforts. Epidemiology research often relies on definitions of race and ethnicity, but these population labels may not adequately capture disease burdens and environmental factors impacting specific sub-populations. Here, we propose a framework for repurposing data from electronic health records (EHRs) in concert with genomic data to explore the demographic ties that can impact disease burdens. Using data from a diverse biobank in New York City, we identified 17 communities sharing recent genetic ancestry. We observed 1,177 health outcomes that were statistically associated with a specific group and demonstrated significant differences in the segregation of genetic variants contributing to Mendelian diseases. We also demonstrated that fine-scale population structure can impact the prediction of complex disease risk within groups. This work reinforces the utility of linking genomic data to EHRs and provides a framework toward fine-scale monitoring of population health.
Collapse
Affiliation(s)
- Gillian M Belbin
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Sinead Cullina
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Stephane Wenric
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Emily R Soper
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Benjamin S Glicksberg
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Denis Torre
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Arden Moscati
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Genevieve L Wojcik
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Ruhollah Shemirani
- Information Science Institute, University of Southern California, Marina del Rey, CA 90089, USA
| | - Noam D Beckmann
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Ariella Cohain
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Elena P Sorokin
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Danny S Park
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Jose-Luis Ambite
- Information Science Institute, University of Southern California, Marina del Rey, CA 90089, USA
| | - Steve Ellis
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Adam Auton
- Department of Genetics, Albert Einstein College of Medicine, New York, NY 10461, USA
| | -
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | -
- Regeneron Genetics Center, Tarrytown, New York, NY 10591, USA
| | - Erwin P Bottinger
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Judy H Cho
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Ruth J F Loos
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Noura S Abul-Husn
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Noah A Zaitlen
- Department of Neurology, University of California, Los Angeles, Los Angeles, CA 90033, USA
| | - Christopher R Gignoux
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA.
| |
Collapse
|
7
|
Shemirani R, Wenric S, Kenny E, Ambite JL. EPS: Automated Feature Selection in Case-Control Studies using Extreme Pseudo-Sampling. Bioinformatics 2021; 37:3372-3373. [PMID: 33774671 DOI: 10.1093/bioinformatics/btab214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/24/2021] [Accepted: 03/26/2021] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Finding informative predictive features in high dimensional biological case-control datasets is challenging. The Extreme Pseudo-Sampling (EPS) algorithm offers a solution to the challenge of feature selection via a combination of deep learning and linear regression models. First, using a variational autoencoder, it generates complex latent representations for the samples. Second, it classifies the latent representations of cases and controls via logistic regression. Third, it generates new samples (pseudo-samples) around the extreme cases and controls in the regression model. Finally, it trains a new regression model over the upsampled space. The most significant variables in this regression are selected. We present an open-source implementation of the algorithm that is easy to set up, use, and customize. Our package enhances the original algorithm by providing new features and customizability for data preparation, model training and classification functionalities. We believe the new features will enable the adoption of the algorithm for a diverse range of datasets. AVAILABILITY The software package for Python is available online at https://github.com/roohy/eps.
Collapse
Affiliation(s)
- Ruhollah Shemirani
- Information Sciences Institute, University of Southern California, Marina del Rey, US
| | - Stephane Wenric
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, US
| | - Eimear Kenny
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, US
| | - José Luis Ambite
- Information Sciences Institute, University of Southern California, Marina del Rey, US
| |
Collapse
|
8
|
Abstract
Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes. Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples. We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival. These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.
Collapse
Affiliation(s)
- Stephane Wenric
- Laboratory of Human Genetics, GIGA-Research, University of Liège, Liège, Belgium.,Department of Genetics and Genomic Sciences, The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai Hospital, New York, NY, United States
| | - Ruhollah Shemirani
- Department of Computer Science, Information Sciences Institute, University of Southern California, Marina del Rey, CA, United States
| |
Collapse
|