1
|
Analysis of 3760 hematologic malignancies reveals rare transcriptomic aberrations of driver genes. Genome Med 2024; 16:70. [PMID: 38769532 PMCID: PMC11103968 DOI: 10.1186/s13073-024-01331-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 04/04/2024] [Indexed: 05/22/2024] Open
Abstract
BACKGROUND Rare oncogenic driver events, particularly affecting the expression or splicing of driver genes, are suspected to substantially contribute to the large heterogeneity of hematologic malignancies. However, their identification remains challenging. METHODS To address this issue, we generated the largest dataset to date of matched whole genome sequencing and total RNA sequencing of hematologic malignancies from 3760 patients spanning 24 disease entities. Taking advantage of our dataset size, we focused on discovering rare regulatory aberrations. Therefore, we called expression and splicing outliers using an extension of the workflow DROP (Detection of RNA Outliers Pipeline) and AbSplice, a variant effect predictor that identifies genetic variants causing aberrant splicing. We next trained a machine learning model integrating these results to prioritize new candidate disease-specific driver genes. RESULTS We found a median of seven expression outlier genes, two splicing outlier genes, and two rare splice-affecting variants per sample. Each category showed significant enrichment for already well-characterized driver genes, with odds ratios exceeding three among genes called in more than five samples. On held-out data, our integrative modeling significantly outperformed modeling based solely on genomic data and revealed promising novel candidate driver genes. Remarkably, we found a truncated form of the low density lipoprotein receptor LRP1B transcript to be aberrantly overexpressed in about half of hairy cell leukemia variant (HCL-V) samples and, to a lesser extent, in closely related B-cell neoplasms. This observation, which was confirmed in an independent cohort, suggests LRP1B as a novel marker for a HCL-V subclass and a yet unreported functional role of LRP1B within these rare entities. CONCLUSIONS Altogether, our census of expression and splicing outliers for 24 hematologic malignancy entities and the companion computational workflow constitute unique resources to deepen our understanding of rare oncogenic events in hematologic cancers.
Collapse
|
2
|
Unravelling undiagnosed rare disease cases by HiFi long-read genome sequencing. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.03.24305331. [PMID: 38746462 PMCID: PMC11092722 DOI: 10.1101/2024.05.03.24305331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Solve-RD is a pan-European rare disease (RD) research program that aims to identify disease-causing genetic variants in previously undiagnosed RD families. We utilised 10-fold coverage HiFi long-read sequencing (LRS) for detecting causative structural variants (SVs), single nucleotide variants (SNVs), insertion-deletions (InDels), and short tandem repeat (STR) expansions in extensively studied RD families without clear molecular diagnoses. Our cohort includes 293 individuals from 114 genetically undiagnosed RD families selected by European Rare Disease Network (ERN) experts. Of these, 21 families were affected by so-called 'unsolvable' syndromes for which genetic causes remain unknown, and 93 families with at least one individual affected by a rare neurological, neuromuscular, or epilepsy disorder without genetic diagnosis despite extensive prior testing. Clinical interpretation and orthogonal validation of variants in known disease genes yielded thirteen novel genetic diagnoses due to de novo and rare inherited SNVs, InDels, SVs, and STR expansions. In an additional four families, we identified a candidate disease-causing SV affecting several genes including an MCF2 / FGF13 fusion and PSMA3 deletion. However, no common genetic cause was identified in any of the 'unsolvable' syndromes. Taken together, we found (likely) disease-causing genetic variants in 13.0% of previously unsolved families and additional candidate disease-causing SVs in another 4.3% of these families. In conclusion, our results demonstrate the added value of HiFi long-read genome sequencing in undiagnosed rare diseases.
Collapse
|
3
|
Cellular energy regulates mRNA degradation in a codon-specific manner. Mol Syst Biol 2024; 20:506-520. [PMID: 38491213 PMCID: PMC11066088 DOI: 10.1038/s44320-024-00026-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 02/19/2024] [Accepted: 02/20/2024] [Indexed: 03/18/2024] Open
Abstract
Codon optimality is a major determinant of mRNA translation and degradation rates. However, whether and through which mechanisms its effects are regulated remains poorly understood. Here we show that codon optimality associates with up to 2-fold change in mRNA stability variations between human tissues, and that its effect is attenuated in tissues with high energy metabolism and amplifies with age. Mathematical modeling and perturbation data through oxygen deprivation and ATP synthesis inhibition reveal that cellular energy variations non-uniformly alter the effect of codon usage. This new mode of codon effect regulation, independent of tRNA regulation, provides a fundamental mechanistic link between cellular energy metabolism and eukaryotic gene expression.
Collapse
|
4
|
Species-aware DNA language models capture regulatory elements and their evolution. Genome Biol 2024; 25:83. [PMID: 38566111 PMCID: PMC10985990 DOI: 10.1186/s13059-024-03221-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 03/20/2024] [Indexed: 04/04/2024] Open
Abstract
BACKGROUND The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. RESULTS Here, we introduce species-aware DNA language models, which we trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, we show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow. Remarkably, DNA language models reconstruct motif instances bound in vivo better than unbound ones and account for the evolution of motif sequences and their positional constraints, showing that these models capture functional high-order sequence and evolutionary context. We further show that species-aware training yields improved sequence representations for endogenous and MPRA-based gene expression prediction, as well as motif discovery. CONCLUSIONS Collectively, these results demonstrate that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes.
Collapse
|
5
|
Viral genome sequencing to decipher in-hospital SARS-CoV-2 transmission events. Sci Rep 2024; 14:5768. [PMID: 38459123 PMCID: PMC10923895 DOI: 10.1038/s41598-024-56162-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 03/02/2024] [Indexed: 03/10/2024] Open
Abstract
The SARS-CoV-2 pandemic has highlighted the need to better define in-hospital transmissions, a need that extends to all other common infectious diseases encountered in clinical settings. To evaluate how whole viral genome sequencing can contribute to deciphering nosocomial SARS-CoV-2 transmission 926 SARS-CoV-2 viral genomes from 622 staff members and patients were collected between February 2020 and January 2021 at a university hospital in Munich, Germany, and analysed along with the place of work, duration of hospital stay, and ward transfers. Bioinformatically defined transmission clusters inferred from viral genome sequencing were compared to those inferred from interview-based contact tracing. An additional dataset collected at the same time at another university hospital in the same city was used to account for multiple independent introductions. Clustering analysis of 619 viral genomes generated 19 clusters ranging from 3 to 31 individuals. Sequencing-based transmission clusters showed little overlap with those based on contact tracing data. The viral genomes were significantly more closely related to each other than comparable genomes collected simultaneously at other hospitals in the same city (n = 829), suggesting nosocomial transmission. Longitudinal sampling from individual patients suggested possible cross-infection events during the hospital stay in 19.2% of individuals (14 of 73 individuals). Clustering analysis of SARS-CoV-2 whole genome sequences can reveal cryptic transmission events missed by classical, interview-based contact tracing, helping to decipher in-hospital transmissions. These results, in line with other studies, advocate for viral genome sequencing as a pathogen transmission surveillance tool in hospitals.
Collapse
|
6
|
CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol 2024; 25:53. [PMID: 38389099 PMCID: PMC10882881 DOI: 10.1186/s13059-023-03113-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Accepted: 11/17/2023] [Indexed: 02/24/2024] Open
Abstract
BACKGROUND The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from genetic data, and these were evaluated by independent assessors. RESULTS Performance was particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Missense variant interpretation methods were able to estimate biochemical effects with increasing accuracy. Assessment of methods for regulatory variants and complex trait disease risk was less definitive and indicates performance potentially suitable for auxiliary use in the clinic. CONCLUSIONS Results show that while current methods are imperfect, they have major utility for research and clinical applications. Emerging methods and increasingly large, robust datasets for training and assessment promise further progress ahead.
Collapse
|
7
|
Impaired biogenesis of basic proteins impacts multiple hallmarks of the aging brain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.20.549210. [PMID: 38260253 PMCID: PMC10802395 DOI: 10.1101/2023.07.20.549210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Aging and neurodegeneration entail diverse cellular and molecular hallmarks. Here, we studied the effects of aging on the transcriptome, translatome, and multiple layers of the proteome in the brain of a short-lived killifish. We reveal that aging causes widespread reduction of proteins enriched in basic amino acids that is independent of mRNA regulation, and it is not due to impaired proteasome activity. Instead, we identify a cascade of events where aberrant translation pausing leads to reduced ribosome availability resulting in proteome remodeling independently of transcriptional regulation. Our research uncovers a vulnerable point in the aging brain's biology - the biogenesis of basic DNA/RNA binding proteins. This vulnerability may represent a unifying principle that connects various aging hallmarks, encompassing genome integrity and the biosynthesis of macromolecules.
Collapse
|
8
|
Deep learning-driven fragment ion series classification enables highly precise and sensitive de novo peptide sequencing. Nat Commun 2024; 15:151. [PMID: 38167372 PMCID: PMC10762064 DOI: 10.1038/s41467-023-44323-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 12/08/2023] [Indexed: 01/05/2024] Open
Abstract
Unlike for DNA and RNA, accurate and high-throughput sequencing methods for proteins are lacking, hindering the utility of proteomics in applications where the sequences are unknown including variant calling, neoepitope identification, and metaproteomics. We introduce Spectralis, a de novo peptide sequencing method for tandem mass spectrometry. Spectralis leverages several innovations including a convolutional neural network layer connecting peaks in spectra spaced by amino acid masses, proposing fragment ion series classification as a pivotal task for de novo peptide sequencing, and a peptide-spectrum confidence score. On spectra for which database search provided a ground truth, Spectralis surpassed 40% sensitivity at 90% precision, nearly doubling state-of-the-art sensitivity. Application to unidentified spectra confirmed its superiority and showcased its applicability to variant calling. Altogether, these algorithmic innovations and the substantial sensitivity increase in the high-precision range constitute an important step toward broadly applicable peptide sequencing.
Collapse
|
9
|
Abstract
Single-cell ATAC sequencing coverage in regulatory regions is typically binarized as an indicator of open chromatin. Here we show that binarization is an unnecessary step that neither improves goodness of fit, clustering, cell type identification nor batch integration. Fragment counts, but not read counts, should instead be modeled, which preserves quantitative regulatory information. These results have immediate implications for single-cell ATAC sequencing analysis.
Collapse
|
10
|
Improved detection of aberrant splicing with FRASER 2.0 and the intron Jaccard index. Am J Hum Genet 2023; 110:2056-2067. [PMID: 38006880 PMCID: PMC10716352 DOI: 10.1016/j.ajhg.2023.10.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 10/20/2023] [Accepted: 10/26/2023] [Indexed: 11/27/2023] Open
Abstract
Detection of aberrantly spliced genes is an important step in RNA-seq-based rare-disease diagnostics. We recently developed FRASER, a denoising autoencoder-based method that outperformed alternative methods of detecting aberrant splicing. However, because FRASER's three splice metrics are partially redundant and tend to be sensitive to sequencing depth, we introduce here a more robust intron-excision metric, the intron Jaccard index, that combines the alternative donor, alternative acceptor, and intron-retention signal into a single value. Moreover, we optimized model parameters and filter cutoffs by using candidate rare-splice-disrupting variants as independent evidence. On 16,213 GTEx samples, our improved algorithm, FRASER 2.0, called typically 10 times fewer splicing outliers while increasing the proportion of candidate rare-splice-disrupting variants by 10-fold and substantially decreasing the effect of sequencing depth on the number of reported outliers. To lower the multiple-testing correction burden, we introduce an option to select the genes to be tested for each sample instead of a transcriptome-wide approach. This option can be particularly useful when prior information, such as candidate variants or genes, is available. Application on 303 rare-disease samples confirmed the relative reduction in the number of outlier calls for a slight loss of sensitivity; FRASER 2.0 recovered 22 out of 26 previously identified pathogenic splicing cases with default cutoffs and 24 when multiple-testing correction was limited to OMIM genes containing rare variants. Altogether, these methodological improvements contribute to more effective RNA-seq-based rare diagnostics by drastically reducing the amount of splicing outlier calls per sample at minimal loss of sensitivity.
Collapse
|
11
|
Evaluation of input data modality choices on functional gene embeddings. NAR Genom Bioinform 2023; 5:lqad095. [PMID: 37942285 PMCID: PMC10629286 DOI: 10.1093/nargab/lqad095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 09/07/2023] [Accepted: 09/28/2023] [Indexed: 11/10/2023] Open
Abstract
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.
Collapse
|
12
|
Epicardioid single-cell genomics uncovers principles of human epicardium biology in heart development and disease. Nat Biotechnol 2023; 41:1787-1800. [PMID: 37012447 PMCID: PMC10713454 DOI: 10.1038/s41587-023-01718-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Accepted: 02/22/2023] [Indexed: 04/05/2023]
Abstract
The epicardium, the mesothelial envelope of the vertebrate heart, is the source of multiple cardiac cell lineages during embryonic development and provides signals that are essential to myocardial growth and repair. Here we generate self-organizing human pluripotent stem cell-derived epicardioids that display retinoic acid-dependent morphological, molecular and functional patterning of the epicardium and myocardium typical of the left ventricular wall. By combining lineage tracing, single-cell transcriptomics and chromatin accessibility profiling, we describe the specification and differentiation process of different cell lineages in epicardioids and draw comparisons to human fetal development at the transcriptional and morphological levels. We then use epicardioids to investigate the functional cross-talk between cardiac cell types, gaining new insights into the role of IGF2/IGF1R and NRP2 signaling in human cardiogenesis. Finally, we show that epicardioids mimic the multicellular pathogenesis of congenital or stress-induced hypertrophy and fibrotic remodeling. As such, epicardioids offer a unique testing ground of epicardial activity in heart development, disease and regeneration.
Collapse
|
13
|
Towards in silico CLIP-seq: predicting protein-RNA interaction via sequence-to-signal learning. Genome Biol 2023; 24:180. [PMID: 37542318 PMCID: PMC10403857 DOI: 10.1186/s13059-023-03015-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 07/17/2023] [Indexed: 08/06/2023] Open
Abstract
We present RBPNet, a novel deep learning method, which predicts CLIP-seq crosslink count distribution from RNA sequence at single-nucleotide resolution. By training on up to a million regions, RBPNet achieves high generalization on eCLIP, iCLIP and miCLIP assays, outperforming state-of-the-art classifiers. RBPNet performs bias correction by modeling the raw signal as a mixture of the protein-specific and background signal. Through model interrogation via Integrated Gradients, RBPNet identifies predictive sub-sequences that correspond to known and novel binding motifs and enables variant-impact scoring via in silico mutagenesis. Together, RBPNet improves imputation of protein-RNA interactions, as well as mechanistic interpretation of predictions.
Collapse
|
14
|
Distinct genetic liability profiles define clinically relevant patient strata across common diseases. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.05.10.23289788. [PMID: 37214898 PMCID: PMC10197798 DOI: 10.1101/2023.05.10.23289788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Genome-wide association studies have unearthed a wealth of genetic associations across many complex diseases. However, translating these associations into biological mechanisms contributing to disease etiology and heterogeneity has been challenging. Here, we hypothesize that the effects of disease-associated genetic variants converge onto distinct cell type specific molecular pathways within distinct subgroups of patients. In order to test this hypothesis, we develop the CASTom-iGEx pipeline to operationalize individual level genotype data to interpret personal polygenic risk and identify the genetic basis of clinical heterogeneity. The paradigmatic application of this approach to coronary artery disease and schizophrenia reveals a convergence of disease associated variant effects onto known and novel genes, pathways, and biological processes. The biological process specific genetic liabilities are not equally distributed across patients. Instead, they defined genetically distinct groups of patients, characterized by different profiles across pathways, endophenotypes, and disease severity. These results provide further evidence for a genetic contribution to clinical heterogeneity and point to the existence of partially distinct pathomechanisms across patient subgroups. Thus, the universally applicable approach presented here has the potential to constitute an important component of future personalized medicine concepts.
Collapse
|
15
|
Aberrant splicing prediction across human tissues. Nat Genet 2023; 55:861-870. [PMID: 37142848 DOI: 10.1038/s41588-023-01373-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Accepted: 03/14/2023] [Indexed: 05/06/2023]
Abstract
Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models can prioritize rare variants for affecting splicing, their performance in predicting tissue-specific aberrant splicing remains unassessed. Here we generated an aberrant splicing benchmark dataset, spanning over 8.8 million rare variants in 49 human tissues from the Genotype-Tissue Expression (GTEx) dataset. At 20% recall, state-of-the-art DNA-based models achieve maximum 12% precision. By mapping and quantifying tissue-specific splice site usage transcriptome-wide and modeling isoform competition, we increased precision by threefold at the same recall. Integrating RNA-sequencing data of clinically accessible tissues into our model, AbSplice, brought precision to 60%. These results, replicated in two independent cohorts, substantially contribute to noncoding loss-of-function variant identification and to genetic diagnostics design and analytics.
Collapse
|
16
|
Improved detection of aberrant splicing using the Intron Jaccard Index. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.03.31.23287997. [PMID: 37066374 PMCID: PMC10104204 DOI: 10.1101/2023.03.31.23287997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Detection of aberrantly spliced genes is an important step in RNA-seq-based rare disease diagnostics. We recently developed FRASER, a denoising autoencoder-based method for aberrant splicing detection that outperformed alternative approaches. However, as FRASER's three splice metrics are partially redundant and tend to be sensitive to sequencing depth, we introduce here a more robust intron excision metric, the Intron Jaccard Index, that combines alternative donor, alternative acceptor, and intron retention signal into a single value. Moreover, we optimized model parameters and filter cutoffs using candidate rare splice-disrupting variants as independent evidence. On 16,213 GTEx samples, our improved algorithm called typically 10 times fewer splicing outliers while increasing the proportion of candidate rare splice-disrupting variants by 10 fold and substantially decreasing the effect of sequencing depth on the number of reported outliers. Application on 303 rare disease samples confirmed the reduction fold-change of the number of outlier calls for a slight loss of sensitivity (only 2 out of 22 previously identified pathogenic splicing cases not recovered). Altogether, these methodological improvements contribute to more effective RNA-seq-based rare diagnostics by a drastic reduction of the amount of splicing outlier calls per sample at minimal loss of sensitivity.
Collapse
|
17
|
Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol 2023; 24:56. [PMID: 36973806 PMCID: PMC10045630 DOI: 10.1186/s13059-023-02899-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 03/16/2023] [Indexed: 03/29/2023] Open
Abstract
BACKGROUND The largest sequence-based models of transcription control to date are obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as those models are exposed during training solely to the sequence variation between human genes that arose through evolution, questioning the extent to which those models capture genuine causal signals. RESULTS Here we confront predictions of state-of-the-art models of transcription regulation against data from two large-scale observational studies and five deep perturbation assays. The most advanced of these sequence-based models, Enformer, by and large, captures causal determinants of human promoters. However, models fail to capture the causal effects of enhancers on expression, notably in medium to long distances and particularly for highly expressed promoters. More generally, the predicted impact of distal elements on gene expression predictions is small and the ability to correctly integrate long-range information is significantly more limited than the receptive fields of the models suggest. This is likely caused by the escalating class imbalance between actual and candidate regulatory elements as distance increases. CONCLUSIONS Our results suggest that sequence-based models have advanced to the point that in silico study of promoter regions and promoter variants can provide meaningful insights and we provide practical guidance on how to use them. Moreover, we foresee that it will require significantly more and particularly new kinds of data to train models accurately accounting for distal elements.
Collapse
|
18
|
Transmicron: accurate prediction of insertion probabilities improves detection of cancer driver genes from transposon mutagenesis screens. Nucleic Acids Res 2023; 51:e21. [PMID: 36617985 PMCID: PMC9976929 DOI: 10.1093/nar/gkac1215] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2022] [Revised: 11/06/2022] [Accepted: 12/17/2022] [Indexed: 01/10/2023] Open
Abstract
Transposon screens are powerful in vivo assays used to identify loci driving carcinogenesis. These loci are identified as Common Insertion Sites (CISs), i.e. regions with more transposon insertions than expected by chance. However, the identification of CISs is affected by biases in the insertion behaviour of transposon systems. Here, we introduce Transmicron, a novel method that differs from previous methods by (i) modelling neutral insertion rates based on chromatin accessibility, transcriptional activity and sequence context and (ii) estimating oncogenic selection for each genomic region using Poisson regression to model insertion counts while controlling for neutral insertion rates. To assess the benefits of our approach, we generated a dataset applying two different transposon systems under comparable conditions. Benchmarking for enrichment of known cancer genes showed improved performance of Transmicron against state-of-the-art methods. Modelling neutral insertion rates allowed for better control of false positives and stronger agreement of the results between transposon systems. Moreover, using Poisson regression to consider intra-sample and inter-sample information proved beneficial in small and moderately-sized datasets. Transmicron is open-source and freely available. Overall, this study contributes to the understanding of transposon biology and introduces a novel approach to use this knowledge for discovering cancer driver genes.
Collapse
|
19
|
The adapted Activity-By-Contact model for enhancer-gene assignment and its application to single-cell data. Bioinformatics 2023; 39:7008325. [PMID: 36708003 PMCID: PMC9931646 DOI: 10.1093/bioinformatics/btad062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 12/05/2022] [Accepted: 01/26/2023] [Indexed: 01/29/2023] Open
Abstract
MOTIVATION Identifying regulatory regions in the genome is of great interest for understanding the epigenomic landscape in cells. One fundamental challenge in this context is to find the target genes whose expression is affected by the regulatory regions. A recent successful method is the Activity-By-Contact (ABC) model which scores enhancer-gene interactions based on enhancer activity and the contact frequency of an enhancer to its target gene. However, it describes regulatory interactions entirely from a gene's perspective, and does not account for all the candidate target genes of an enhancer. In addition, the ABC model requires two types of assays to measure enhancer activity, which limits the applicability. Moreover, there is neither implementation available that could allow for an integration with transcription factor (TF) binding information nor an efficient analysis of single-cell data. RESULTS We demonstrate that the ABC score can yield a higher accuracy by adapting the enhancer activity according to the number of contacts the enhancer has to its candidate target genes and also by considering all annotated transcription start sites of a gene. Further, we show that the model is comparably accurate with only one assay to measure enhancer activity. We combined our generalized ABC model with TF binding information and illustrated an analysis of a single-cell ATAC-seq dataset of the human heart, where we were able to characterize cell type-specific regulatory interactions and predict gene expression based on TF affinities. All executed processing steps are incorporated into our new computational pipeline STARE. AVAILABILITY AND IMPLEMENTATION The software is available at https://github.com/schulzlab/STARE. CONTACT marcel.schulz@em.uni-frankfurt.de. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
20
|
Leigh syndrome is the main clinical characteristic of
PTCD3
deficiency. Brain Pathol 2022; 33:e13134. [PMID: 36450274 PMCID: PMC10154364 DOI: 10.1111/bpa.13134] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 11/16/2022] [Indexed: 12/05/2022] Open
Abstract
Mitochondrial translation defects are a continuously growing group of disorders showing a large variety of clinical symptoms including a wide range of neurological abnormalities. To date, mutations in PTCD3, encoding a component of the mitochondrial ribosome, have only been reported in a single individual with clinical evidence of Leigh syndrome. Here, we describe three additional PTCD3 individuals from two unrelated families, broadening the genetic and phenotypic spectrum of this disorder, and provide definitive evidence that PTCD3 deficiency is associated with Leigh syndrome. The patients presented in the first months of life with psychomotor delay, respiratory insufficiency and feeding difficulties. The neurologic phenotype included dystonia, optic atrophy, nystagmus and tonic-clonic seizures. Brain MRI showed optic nerve atrophy and thalamic changes, consistent with Leigh syndrome. WES and RNA-seq identified compound heterozygous variants in PTCD3 in both families: c.[1453-1G>C];[1918C>G] and c.[710del];[902C>T]. The functional consequences of the identified variants were determined by a comprehensive characterization of the mitochondrial function. PTCD3 protein levels were significantly reduced in patient fibroblasts and, consistent with a mitochondrial translation defect, a severe reduction in the steady state levels of complexes I and IV subunits was detected. Accordingly, the activity of these complexes was also low, and high-resolution respirometry showed a significant decrease in the mitochondrial respiratory capacity. Functional complementation studies demonstrated the pathogenic effect of the identified variants since the expression of wild-type PTCD3 in immortalized fibroblasts restored the steady-state levels of complexes I and IV subunits as well as the mitochondrial respiratory capacity. Additionally, minigene assays demonstrated that three of the identified variants were pathogenic by altering PTCD3 mRNA processing. The fourth variant was a frameshift leading to a truncated protein. In summary, we provide evidence of PTCD3 involvement in human disease confirming that PTCD3 deficiency is definitively associated with Leigh syndrome.
Collapse
|
21
|
Abstract P2061: The Long Non-coding RNA
Schlafenlnc
As A Regulator Of Cardiac Resident Macrophage Function. Circ Res 2022. [DOI: 10.1161/res.131.suppl_1.p2061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Introduction:
Cardiac resident macrophages (crMΦs) constitute up to 5% of cells in the murine heart and were shown to play key roles in cardiac homeostasis and disease. Long non-coding RNAs (lncRNAs) are regulatory molecules that impact characteristics such as cell identity, proliferation or migration. However, the function of lncRNAs in crMΦ remains enigmatic.
Objective:
We sought to identify crMΦ-specific lncRNAs and analyze their function
in vitro
and
in vivo
to understand their role during health and cardiac disease.
Methods and Results:
Using RNASeq (>100 million reads/sample) of purified murine crMΦs and single cell Seq of total murine myocardium in health and disease, we could identify the lncRNA
Schlafenlnc
as a highly enriched and abundant lncRNA in crMΦs. Employing the CRISPR-Cas system we successfully deleted the full
Schlafenlnc
locus in a macrophage progenitor cell line. Next, we performed RNASeq of
Schlafenlnc
-/-
macrophages and could observe 2,660 significantly deregulated genes that were enriched in genes associated with chemotaxis and migration. In line with these findings,
Schlafenlnc
-/-
macrophages displayed decreased chemotaxis as well as adhesion using cell-based assays. Furthermore, using RNA-pulldown experiments followed by mass spectrometry analysis and we could identify 27 interaction partners of
Schlafenlnc
, which are involved in processes such as mRNA processing, transcriptional regulation and alternative splicing. Finally, we are currently using cardiac functional measurements, macrophage stainings as well as single cell Seq during health and cardiac disease to analyze the function of
Schlafenlnc in vivo.
Conclusion:
In this study, we could identify the crMΦ-specific lncRNA
Schlafenlnc
as a critical regulator of macrophage migratory functions. Therapeutic targeting of the evolutionary conserved lncRNA
Schlafenlnc
might therefore be beneficial in the treatment of inflammatory cardiac diseases.
Collapse
|
22
|
Author Correction: Detection of aberrant splicing events in RNA-seq data using FRASER. Nat Commun 2022; 13:3474. [PMID: 35710804 PMCID: PMC9203570 DOI: 10.1038/s41467-022-31242-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
|
23
|
Abstract
BACKGROUND Lack of functional evidence hampers variant interpretation, leaving a large proportion of individuals with a suspected Mendelian disorder without genetic diagnosis after whole genome or whole exome sequencing (WES). Research studies advocate to further sequence transcriptomes to directly and systematically probe gene expression defects. However, collection of additional biopsies and establishment of lab workflows, analytical pipelines, and defined concepts in clinical interpretation of aberrant gene expression are still needed for adopting RNA sequencing (RNA-seq) in routine diagnostics. METHODS We implemented an automated RNA-seq protocol and a computational workflow with which we analyzed skin fibroblasts of 303 individuals with a suspected mitochondrial disease that previously underwent WES. We also assessed through simulations how aberrant expression and mono-allelic expression tests depend on RNA-seq coverage. RESULTS We detected on average 12,500 genes per sample including around 60% of all disease genes-a coverage substantially higher than with whole blood, supporting the use of skin biopsies. We prioritized genes demonstrating aberrant expression, aberrant splicing, or mono-allelic expression. The pipeline required less than 1 week from sample preparation to result reporting and provided a median of eight disease-associated genes per patient for inspection. A genetic diagnosis was established for 16% of the 205 WES-inconclusive cases. Detection of aberrant expression was a major contributor to diagnosis including instances of 50% reduction, which, together with mono-allelic expression, allowed for the diagnosis of dominant disorders caused by haploinsufficiency. Moreover, calling aberrant splicing and variants from RNA-seq data enabled detecting and validating splice-disrupting variants, of which the majority fell outside WES-covered regions. CONCLUSION Together, these results show that streamlined experimental and computational processes can accelerate the implementation of RNA-seq in routine diagnostics.
Collapse
|
24
|
Transcriptome-wide association study of coronary artery disease identifies novel susceptibility genes. Basic Res Cardiol 2022; 117:6. [PMID: 35175464 PMCID: PMC8852935 DOI: 10.1007/s00395-022-00917-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 01/18/2022] [Accepted: 02/01/2022] [Indexed: 01/31/2023]
Abstract
The majority of risk loci identified by genome-wide association studies (GWAS) are in non-coding regions, hampering their functional interpretation. Instead, transcriptome-wide association studies (TWAS) identify gene-trait associations, which can be used to prioritize candidate genes in disease-relevant tissue(s). Here, we aimed to systematically identify susceptibility genes for coronary artery disease (CAD) by TWAS. We trained prediction models of nine CAD-relevant tissues using EpiXcan based on two genetics-of-gene-expression panels, the Stockholm-Tartu Atherosclerosis Reverse Network Engineering Task (STARNET) and the Genotype-Tissue Expression (GTEx). Based on these prediction models, we imputed gene expression of respective tissues from individual-level genotype data on 37,997 CAD cases and 42,854 controls for the subsequent gene-trait association analysis. Transcriptome-wide significant association (i.e. P < 3.85e-6) was observed for 114 genes. Of these, 96 resided within previously identified GWAS risk loci and 18 were novel. Stepwise analyses were performed to study their plausibility, biological function, and pathogenicity in CAD, including analyses for colocalization, damaging mutations, pathway enrichment, phenome-wide associations with human data and expression-traits correlations using mouse data. Finally, CRISPR/Cas9-based gene knockdown of two newly identified TWAS genes, RGS19 and KPTN, in a human hepatocyte cell line resulted in reduced secretion of APOB100 and lipids in the cell culture medium. Our CAD TWAS work (i) prioritized candidate causal genes at known GWAS loci, (ii) identified 18 novel genes to be associated with CAD, and iii) suggested potential tissues and pathways of action for these TWAS CAD genes.
Collapse
|
25
|
PO-0968 ongoing head and neck contour peer review improves quality of radiotherapy targets. Radiother Oncol 2021. [DOI: 10.1016/s0167-8140(21)07419-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
26
|
Predicting mean ribosome load for 5'UTR of any length using deep learning. PLoS Comput Biol 2021; 17:e1008982. [PMID: 33970899 PMCID: PMC8136849 DOI: 10.1371/journal.pcbi.1008982] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/20/2021] [Accepted: 04/19/2021] [Indexed: 01/07/2023] Open
Abstract
The 5’ untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5’UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)—a proxy for translation rate—directly from 5’UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5’UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5’UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants. The human genome carries a complex code. It consists of genes, which provide blueprints to assemble proteins, and regulatory elements, which control when, where, and how often particular genes are transcribed and translated into protein. To read the genome correctly and specifically to find the causes of inherited diseases, we need to be able to find and interpret these regulatory elements. Here, we focus on particular regions of the genome, the so-called 5’ untranslated regions, which play an important role in determining how often a transcribed gene is translated into protein. We develop deep learning models which can quantitatively interpret regulatory elements in human 5’ untranslated regions and use this information to predict a proxy of the translation efficiency. Our model generalizes a previous model to 5’ untranslated regions of any length, just as they are encountered in natural human genes. Because this model requires only the sequence as input, it can give estimates for the impact of mutations in the sequence, even if these particular mutations are very rare or entirely novel. Such estimates could help pinpoint mutations that disrupt the normal functioning of gene regulation, which could be used to better diagnose patients suffering from rare genetic disorders.
Collapse
|
27
|
MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome Biol 2021; 22:94. [PMID: 33789710 PMCID: PMC8011109 DOI: 10.1186/s13059-021-02273-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2020] [Accepted: 01/14/2021] [Indexed: 12/20/2022] Open
Abstract
We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.
Collapse
|
28
|
Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 2021; 53:354-366. [PMID: 33603233 PMCID: PMC8812996 DOI: 10.1038/s41588-021-00782-6] [Citation(s) in RCA: 203] [Impact Index Per Article: 67.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 01/07/2021] [Indexed: 01/30/2023]
Abstract
The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.
Collapse
|
29
|
Transcriptome-directed analysis for Mendelian disease diagnosis overcomes limitations of conventional genomic testing. J Clin Invest 2021; 131:141500. [PMID: 33001864 PMCID: PMC7773386 DOI: 10.1172/jci141500] [Citation(s) in RCA: 73] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Accepted: 09/24/2020] [Indexed: 12/28/2022] Open
Abstract
BACKGROUNDTranscriptome sequencing (RNA-seq) improves diagnostic rates in individuals with suspected Mendelian conditions to varying degrees, primarily by directing the prioritization of candidate DNA variants identified on exome or genome sequencing (ES/GS). Here we implemented an RNA-seq-guided method to diagnose individuals across a wide range of ages and clinical phenotypes.METHODSOne hundred fifteen undiagnosed adult and pediatric patients with diverse phenotypes and 67 family members (182 total individuals) underwent RNA-seq from whole blood and skin fibroblasts at the Baylor College of Medicine (BCM) Undiagnosed Diseases Network clinical site from 2014 to 2020. We implemented a workflow to detect outliers in gene expression and splicing for cases that remained undiagnosed despite standard genomic and transcriptomic analysis.RESULTSThe transcriptome-directed approach resulted in a diagnostic rate of 12% across the entire cohort, or 17% after excluding cases solved on ES/GS alone. Newly diagnosed conditions included Koolen-de Vries syndrome (KANSL1), Renpenning syndrome (PQBP1), TBCK-associated encephalopathy, NSD2- and CLTC-related intellectual disability, and others, all with negative conventional genomic testing, including ES and chromosomal microarray (CMA). Skin fibroblasts exhibited higher and more consistent expression of clinically relevant genes than whole blood. In solved cases with RNA-seq from both tissues, the causative defect was missed in blood in half the cases but none from fibroblasts.CONCLUSIONSFor our cohort of undiagnosed individuals with suspected Mendelian conditions, transcriptome-directed genomic analysis facilitated diagnoses, primarily through the identification of variants missed on ES and CMA.TRIAL REGISTRATIONNot applicable.FUNDINGNIH Common Fund, BCM Intellectual and Developmental Disabilities Research Center, Eunice Kennedy Shriver National Institute of Child Health & Human Development.
Collapse
|
30
|
Quantification of Proteins and Histone Marks in Drosophila Embryos Reveals Stoichiometric Relationships Impacting Chromatin Regulation. Dev Cell 2019; 51:632-644.e6. [PMID: 31630981 DOI: 10.1016/j.devcel.2019.09.011] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Revised: 06/09/2019] [Accepted: 09/16/2019] [Indexed: 10/25/2022]
Abstract
Gene transcription in eukaryotes is regulated through dynamic interactions of a variety of different proteins with DNA in the context of chromatin. Here, we used mass spectrometry for absolute quantification of the nuclear proteome and methyl marks on selected lysine residues in histone H3 during two stages of Drosophila embryogenesis. These analyses provide comprehensive information about the absolute copy number of several thousand proteins and reveal unexpected relationships between the abundance of histone-modifying and -binding proteins and the chromatin landscape that they generate and interact with. For some histone modifications, the levels in Drosophila embryos are substantially different from those previously reported in tissue culture cells. Genome-wide profiling of H3K27 methylation during developmental progression and in animals with reduced PRC2 levels illustrates how mass spectrometry can be used for quantitatively describing and comparing chromatin states. Together, these data provide a foundation toward a quantitative understanding of gene regulation in Drosophila.
Collapse
|
31
|
Clinical Outcomes of Single or Multi-Fractionated, Single-Isocenter, Multi-Arc Volumetric Modulated Radiotherapy (VMAT) for Stereotactic Radiosurgery (SRS) for Palliation of Multiple Brain Metastases. Int J Radiat Oncol Biol Phys 2019. [DOI: 10.1016/j.ijrobp.2019.06.1313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
32
|
Assessing predictions of the impact of variants on splicing in CAGI5. Hum Mutat 2019; 40:1215-1224. [PMID: 31301154 DOI: 10.1002/humu.23869] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Revised: 06/20/2019] [Accepted: 07/10/2019] [Indexed: 12/28/2022]
Abstract
Precision medicine and sequence-based clinical diagnostics seek to predict disease risk or to identify causative variants from sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. In the past, few CAGI challenges have addressed the impact of sequence variants on splicing. In CAGI5, two challenges (Vex-seq and MaPSY) involved prediction of the effect of variants, primarily single-nucleotide changes, on splicing. Although there are significant differences between these two challenges, both involved prediction of results from high-throughput exon inclusion assays. Here, we discuss the methods used to predict the impact of these variants on splicing, their performance, strengths, and weaknesses, and prospects for predicting the impact of sequence variation on splicing and disease phenotypes.
Collapse
|
33
|
CAGI 5 splicing challenge: Improved exon skipping and intron retention predictions with MMSplice. Hum Mutat 2019; 40:1243-1251. [PMID: 31070280 DOI: 10.1002/humu.23788] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Revised: 04/17/2019] [Accepted: 05/06/2019] [Indexed: 11/10/2022]
Abstract
Pathogenic genetic variants often primarily affect splicing. However, it remains difficult to quantitatively predict whether and how genetic variants affect splicing. In 2018, the fifth edition of the Critical Assessment of Genome Interpretation proposed two splicing prediction challenges based on experimental perturbation assays: Vex-seq, assessing exon skipping, and MaPSy, assessing splicing efficiency. We developed a modular modeling framework, MMSplice, the performance of which was among the best on both challenges. Here we provide insights into the modeling assumptions of MMSplice and its individual modules. We furthermore illustrate how MMSplice can be applied in practice for individual genome interpretation, using the MMSplice VEP plugin and the Kipoi variant interpretation plugin, which are directly applicable to VCF files.
Collapse
|
34
|
Abstract
As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.
Collapse
|
35
|
Global donor and acceptor splicing site kinetics in human cells. eLife 2019; 8:45056. [PMID: 31025937 PMCID: PMC6548502 DOI: 10.7554/elife.45056] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2019] [Accepted: 04/25/2019] [Indexed: 11/13/2022] Open
Abstract
RNA splicing is an essential part of eukaryotic gene expression. Although the mechanism of splicing has been extensively studied in vitro, in vivo kinetics for the two-step splicing reaction remain poorly understood. Here, we combine transient transcriptome sequencing (TT-seq) and mathematical modeling to quantify RNA metabolic rates at donor and acceptor splice sites across the human genome. Splicing occurs in the range of minutes and is limited by the speed of RNA polymerase elongation. Splicing kinetics strongly depends on the position and nature of nucleotides flanking splice sites, and on structural interactions between unspliced RNA and small nuclear RNAs in spliceosomal intermediates. Finally, we introduce the 'yield' of splicing as the efficiency of converting unspliced to spliced RNA and show that it is highest for mRNAs and independent of splicing kinetics. These results lead to quantitative models describing how splicing rates and yield are encoded in the human genome.
Collapse
|
36
|
PO-0719 head and neck contour peer review improves quality of radiotherapy targets. Radiother Oncol 2019. [DOI: 10.1016/s0167-8140(19)31139-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
37
|
MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol 2019; 20:48. [PMID: 30823901 PMCID: PMC6396468 DOI: 10.1186/s13059-019-1653-z] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Accepted: 02/12/2019] [Indexed: 12/15/2022] Open
Abstract
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files.
Collapse
|
38
|
A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol Syst Biol 2019; 15:e8503. [PMID: 30777892 PMCID: PMC6379049 DOI: 10.15252/msb.20188503] [Citation(s) in RCA: 394] [Impact Index Per Article: 78.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2018] [Revised: 01/01/2019] [Accepted: 01/08/2019] [Indexed: 11/28/2022] Open
Abstract
Genome-, transcriptome- and proteome-wide measurements provide insights into how biological systems are regulated. However, fundamental aspects relating to which human proteins exist, where they are expressed and in which quantities are not fully understood. Therefore, we generated a quantitative proteome and transcriptome abundance atlas of 29 paired healthy human tissues from the Human Protein Atlas project representing human genes by 18,072 transcripts and 13,640 proteins including 37 without prior protein-level evidence. The analysis revealed that hundreds of proteins, particularly in testis, could not be detected even for highly expressed mRNAs, that few proteins show tissue-specific expression, that strong differences between mRNA and protein quantities within and across tissues exist and that protein expression is often more stable across tissues than that of transcripts. Only 238 of 9,848 amino acid variants found by exome sequencing could be confidently detected at the protein level showing that proteogenomics remains challenging, needs better computational methods and requires rigorous validation. Many uses of this resource can be envisaged including the study of gene/protein expression regulation and biomarker specificity evaluation.
Collapse
|
39
|
Quantification and discovery of sequence determinants of protein-per-mRNA amount in 29 human tissues. Mol Syst Biol 2019; 15:e8513. [PMID: 30777893 PMCID: PMC6379048 DOI: 10.15252/msb.20188513] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Revised: 01/22/2019] [Accepted: 01/23/2019] [Indexed: 12/15/2022] Open
Abstract
Despite their importance in determining protein abundance, a comprehensive catalogue of sequence features controlling protein-to-mRNA (PTR) ratios and a quantification of their effects are still lacking. Here, we quantified PTR ratios for 11,575 proteins across 29 human tissues using matched transcriptomes and proteomes. We estimated by regression the contribution of known sequence determinants of protein synthesis and degradation in addition to 45 mRNA and 3 protein sequence motifs that we found by association testing. While PTR ratios span more than 2 orders of magnitude, our integrative model predicts PTR ratios at a median precision of 3.2-fold. A reporter assay provided functional support for two novel UTR motifs, and an immobilized mRNA affinity competition-binding assay identified motif-specific bound proteins for one motif. Moreover, our integrative model led to a new metric of codon optimality that captures the effects of codon frequency on protein synthesis and degradation. Altogether, this study shows that a large fraction of PTR ratio variation in human tissues can be predicted from sequence, and it identifies many new candidate post-transcriptional regulatory elements.
Collapse
|
40
|
OUTRIDER: A Statistical Method for Detecting Aberrantly Expressed Genes in RNA Sequencing Data. Am J Hum Genet 2018; 103:907-917. [PMID: 30503520 PMCID: PMC6288422 DOI: 10.1016/j.ajhg.2018.10.025] [Citation(s) in RCA: 76] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Accepted: 10/25/2018] [Indexed: 11/16/2022] Open
Abstract
RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (Outlier in RNA-Seq Finder), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read-count expectations according to the gene covariation resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best recall of artificially corrupted data. Precision-recall analyses using simulated outlier read counts demonstrated the importance of controlling for covariation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a dataset, for identifying outlier samples with too many aberrantly expressed genes, and for detecting aberrant gene expression on the basis of false-discovery-rate-adjusted p values. Overall, OUTRIDER provides an end-to-end solution for identifying aberrantly expressed genes and is suitable for use by rare-disease diagnostic platforms.
Collapse
|
41
|
OCR-Stats: Robust estimation and statistical testing of mitochondrial respiration activities using Seahorse XF Analyzer. PLoS One 2018; 13:e0199938. [PMID: 29995917 PMCID: PMC6040740 DOI: 10.1371/journal.pone.0199938] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Accepted: 06/16/2018] [Indexed: 12/02/2022] Open
Abstract
The accurate quantification of cellular and mitochondrial bioenergetic activity is of great interest in medicine and biology. Mitochondrial stress tests performed with Seahorse Bioscience XF Analyzers allow the estimation of different bioenergetic measures by monitoring the oxygen consumption rates (OCR) of living cells in multi-well plates. However, studies of the statistical best practices for determining aggregated OCR measurements and comparisons have been lacking. Therefore, to understand how OCR behaves across different biological samples, wells, and plates, we performed mitochondrial stress tests in 126 96-well plates involving 203 fibroblast cell lines. We show that the noise of OCR is multiplicative, that outlier data points can concern individual measurements or all measurements of a well, and that the inter-plate variation is greater than the intra-plate variation. Based on these insights, we developed a novel statistical method, OCR-Stats, that: i) robustly estimates OCR levels modeling multiplicative noise and automatically identifying outlier data points and outlier wells; and ii) performs statistical testing between samples, taking into account the different magnitudes of the between- and within-plate variations. This led to a significant reduction of the coefficient of variation across plates of basal respiration by 45% and of maximal respiration by 29%. Moreover, using positive and negative controls, we show that our statistical test outperforms the existing methods, which suffer from an excess of either false positives (within-plate methods), or false negatives (between-plate methods). Altogether, this study provides statistical good practices to support experimentalists in designing, analyzing, testing, and reporting the results of mitochondrial stress tests using this high throughput platform.
Collapse
|
42
|
GenoGAM 2.0: scalable and efficient implementation of genome-wide generalized additive models for gigabase-scale genomes. BMC Bioinformatics 2018; 19:247. [PMID: 29945559 PMCID: PMC6020310 DOI: 10.1186/s12859-018-2238-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 06/12/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND GenoGAM (Genome-wide generalized additive models) is a powerful statistical modeling tool for the analysis of ChIP-Seq data with flexible factorial design experiments. However large runtime and memory requirements of its current implementation prohibit its application to gigabase-scale genomes such as mammalian genomes. RESULTS Here we present GenoGAM 2.0, a scalable and efficient implementation that is 2 to 3 orders of magnitude faster than the previous version. This is achieved by exploiting the sparsity of the model using the SuperLU direct solver for parameter fitting, and sparse Cholesky factorization together with the sparse inverse subset algorithm for computing standard errors. Furthermore the HDF5 library is employed to store data efficiently on hard drive, reducing memory footprint while keeping I/O low. Whole-genome fits for human ChIP-seq datasets (ca. 300 million parameters) could be obtained in less than 9 hours on a standard 60-core server. GenoGAM 2.0 is implemented as an open source R package and currently available on GitHub. A Bioconductor release of the new version is in preparation. CONCLUSIONS We have vastly improved the performance of the GenoGAM framework, opening up its application to all types of organisms. Moreover, our algorithmic improvements for fitting large GAMs could be of interest to the statistical community beyond the genomics field.
Collapse
|
43
|
Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks. Bioinformatics 2018; 34:1261-1269. [PMID: 29155928 PMCID: PMC5905632 DOI: 10.1093/bioinformatics/btx727] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Revised: 10/16/2017] [Accepted: 11/15/2017] [Indexed: 12/01/2022] Open
Abstract
Motivation Regulatory sequences are not solely defined by their nucleic acid sequence but also by their relative distances to genomic landmarks such as transcription start site, exon boundaries or polyadenylation site. Deep learning has become the approach of choice for modeling regulatory sequences because of its strength to learn complex sequence features. However, modeling relative distances to genomic landmarks in deep neural networks has not been addressed. Results Here we developed spline transformation, a neural network module based on splines to flexibly and robustly model distances. Modeling distances to various genomic landmarks with spline transformations significantly increased state-of-the-art prediction accuracy of in vivo RNA-binding protein binding sites for 120 out of 123 proteins. We also developed a deep neural network for human splice branchpoint based on spline transformations that outperformed the current best, already distance-based, machine learning model. Compared to piecewise linear transformation, as obtained by composition of rectified linear units, spline transformation yields higher prediction accuracy as well as faster and more robust training. As spline transformation can be applied to further quantities beyond distances, such as methylation or conservation, we foresee it as a versatile component in the genomics deep learning toolbox. Availability and implementation Spline transformation is implemented as a Keras layer in the CONCISE python package: https://github.com/gagneurlab/concise. Analysis code is available at https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017. Contact avsec@in.tum.de or gagneur@in.tum.de. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
44
|
GenoGAM: genome-wide generalized additive models for ChIP-Seq analysis. Bioinformatics 2018; 33:2258-2265. [PMID: 28369277 DOI: 10.1093/bioinformatics/btx150] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2016] [Accepted: 03/20/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) is a widely used approach to study protein-DNA interactions. Often, the quantities of interest are the differential occupancies relative to controls, between genetic backgrounds, treatments, or combinations thereof. Current methods for differential occupancy of ChIP-Seq data rely however on binning or sliding window techniques, for which the choice of the window and bin sizes are subjective. Results Here, we present GenoGAM (Genome-wide Generalized Additive Model), which brings the well-established and flexible generalized additive models framework to genomic applications using a data parallelism strategy. We model ChIP-Seq read count frequencies as products of smooth functions along chromosomes. Smoothing parameters are objectively estimated from the data by cross-validation, eliminating ad hoc binning and windowing needed by current approaches. GenoGAM provides base-level and region-level significance testing for full factorial designs. Application to a ChIP-Seq dataset in yeast showed increased sensitivity over existing differential occupancy methods while controlling for type I error rate. By analyzing a set of DNA methylation data and illustrating an extension to a peak caller, we further demonstrate the potential of GenoGAM as a generic statistical modeling tool for genome-wide assays. Availability and Implementation Software is available from Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/GenoGAM.html . Contact gagneur@in.tum.de. Supplementary information Supplementary information is available at Bioinformatics online.
Collapse
|
45
|
Inhibition of oxidative stress in cholinergic projection neurons fully rescues aging-associated olfactory circuit degeneration in Drosophila. eLife 2018; 7:32018. [PMID: 29345616 PMCID: PMC5790380 DOI: 10.7554/elife.32018] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Accepted: 01/16/2018] [Indexed: 12/03/2022] Open
Abstract
Loss of the sense of smell is among the first signs of natural aging and neurodegenerative diseases such as Alzheimer’s and Parkinson’s. Cellular and molecular mechanisms promoting this smell loss are not understood. Here, we show that Drosophila melanogaster also loses olfaction before vision with age. Within the olfactory circuit, cholinergic projection neurons show a reduced odor response accompanied by a defect in axonal integrity and reduction in synaptic marker proteins. Using behavioral functional screening, we pinpoint that expression of the mitochondrial reactive oxygen scavenger SOD2 in cholinergic projection neurons is necessary and sufficient to prevent smell degeneration in aging flies. Together, our data suggest that oxidative stress induced axonal degeneration in a single class of neurons drives the functional decline of an entire neural network and the behavior it controls. Given the important role of the cholinergic system in neurodegeneration, the fly olfactory system could be a useful model for the identification of drug targets.
Collapse
|
46
|
Cis-regulatory elements explain most of the mRNA stability variation across genes in yeast. RNA (NEW YORK, N.Y.) 2017; 23:1648-1659. [PMID: 28802259 PMCID: PMC5648033 DOI: 10.1261/rna.062224.117] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Accepted: 07/31/2017] [Indexed: 05/09/2023]
Abstract
The stability of mRNA is one of the major determinants of gene expression. Although a wealth of sequence elements regulating mRNA stability has been described, their quantitative contributions to half-life are unknown. Here, we built a quantitative model for Saccharomyces cerevisiae based on functional mRNA sequence features that explains 59% of the half-life variation between genes and predicts half-life at a median relative error of 30%. The model revealed a new destabilizing 3' UTR motif, ATATTC, which we functionally validated. Codon usage proves to be the major determinant of mRNA stability. Nonetheless, single-nucleotide variations have the largest effect when occurring on 3' UTR motifs or upstream AUGs. Analyzing mRNA half-life data of 34 knockout strains showed that the effect of codon usage not only requires functional decapping and deadenylation, but also the 5'-to-3' exonuclease Xrn1, the nonsense-mediated decay genes, but not no-go decay. Altogether, this study quantitatively delineates the contributions of mRNA sequence features on stability in yeast, reveals their functional dependencies on degradation pathways, and allows accurate prediction of half-life from mRNA sequence.
Collapse
|
47
|
Caenorhabditis elegans CES-1 Snail Represses pig-1 MELK Expression To Control Asymmetric Cell Division. Genetics 2017; 206:2069-2084. [PMID: 28652378 PMCID: PMC5560807 DOI: 10.1534/genetics.117.202754] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Accepted: 06/16/2017] [Indexed: 02/07/2023] Open
Abstract
Snail-like transcription factors affect stem cell function through mechanisms that are incompletely understood. In the Caenorhabditis elegans neurosecretory motor neuron (NSM) neuroblast lineage, CES-1 Snail coordinates cell cycle progression and cell polarity to ensure the asymmetric division of the NSM neuroblast and the generation of two daughter cells of different sizes and fates. We have previously shown that CES-1 Snail controls cell cycle progression by repressing the expression of cdc-25.2 CDC25. However, the mechanism through which CES-1 Snail affects cell polarity has been elusive. Here, we systematically searched for direct targets of CES-1 Snail by genome-wide profiling of CES-1 Snail binding sites and identified >3000 potential CES-1 Snail target genes, including pig-1, the ortholog of the oncogene maternal embryonic leucine zipper kinase (MELK). Furthermore, we show that CES-1 Snail represses pig-1 MELK transcription in the NSM neuroblast lineage and that pig-1 MELK acts downstream of ces-1 Snail to cause the NSM neuroblast to divide asymmetrically by size and along the correct cell division axis. Based on our results we propose that by regulating the expression of the MELK gene, Snail-like transcription factors affect the ability of stem cells to divide asymmetrically and, hence, to self-renew. Furthermore, we speculate that the deregulation of MELK contributes to tumorigenesis by causing cells that normally divide asymmetrically to divide symmetrically instead.
Collapse
|
48
|
Abstract
Across a variety of Mendelian disorders, ∼50-75% of patients do not receive a genetic diagnosis by exome sequencing indicating disease-causing variants in non-coding regions. Although genome sequencing in principle reveals all genetic variants, their sizeable number and poorer annotation make prioritization challenging. Here, we demonstrate the power of transcriptome sequencing to molecularly diagnose 10% (5 of 48) of mitochondriopathy patients and identify candidate genes for the remainder. We find a median of one aberrantly expressed gene, five aberrant splicing events and six mono-allelically expressed rare variants in patient-derived fibroblasts and establish disease-causing roles for each kind. Private exons often arise from cryptic splice sites providing an important clue for variant prioritization. One such event is found in the complex I assembly factor TIMMDC1 establishing a novel disease-associated gene. In conclusion, our study expands the diagnostic tools for detecting non-exonic variants and provides examples of intronic loss-of-function variants with pathological relevance.
Collapse
|
49
|
Chromatin-remodeling factor SMARCD2 regulates transcriptional networks controlling differentiation of neutrophil granulocytes. Nat Genet 2017; 49:742-752. [PMID: 28369036 DOI: 10.1038/ng.3833] [Citation(s) in RCA: 68] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2016] [Accepted: 03/10/2017] [Indexed: 02/06/2023]
Abstract
We identify SMARCD2 (SWI/SNF-related, matrix-associated, actin-dependent regulator of chromatin, subfamily D, member 2), also known as BAF60b (BRG1/Brahma-associated factor 60b), as a critical regulator of myeloid differentiation in humans, mice, and zebrafish. Studying patients from three unrelated pedigrees characterized by neutropenia, specific granule deficiency, myelodysplasia with excess of blast cells, and various developmental aberrations, we identified three homozygous loss-of-function mutations in SMARCD2. Using mice and zebrafish as model systems, we showed that SMARCD2 controls early steps in the differentiation of myeloid-erythroid progenitor cells. In vitro, SMARCD2 interacts with the transcription factor CEBPɛ and controls expression of neutrophil proteins stored in specific granules. Defective expression of SMARCD2 leads to transcriptional and chromatin changes in acute myeloid leukemia (AML) human promyelocytic cells. In summary, SMARCD2 is a key factor controlling myelopoiesis and is a potential tumor suppressor in leukemia.
Collapse
|
50
|
Abstract
To monitor transcriptional regulation in human cells, rapid changes in enhancer and promoter activity must be captured with high sensitivity and temporal resolution. Here, we show that the recently established protocol TT-seq ("transient transcriptome sequencing") can monitor rapid changes in transcription from enhancers and promoters during the immediate response of T cells to ionomycin and phorbol 12-myristate 13-acetate (PMA). TT-seq maps eRNAs and mRNAs every 5 min after T-cell stimulation with high sensitivity and identifies many new primary response genes. TT-seq reveals that the synthesis of 1,601 eRNAs and 650 mRNAs changes significantly within only 15 min after stimulation, when standard RNA-seq does not detect differentially expressed genes. Transcription of enhancers that are primed for activation by nucleosome depletion can occur immediately and simultaneously with transcription of target gene promoters. Our results indicate that enhancer transcription is a good proxy for enhancer regulatory activity in target gene activation, and establish TT-seq as a tool for monitoring the dynamics of enhancer landscapes and transcription programs during cellular responses and differentiation.
Collapse
|