1
|
Bakhtiar D, Vondraskova K, Pengelly RJ, Chivers M, Kralovicova J, Vorechovsky I. Exonic splicing code and coordination of divalent metals in proteins. Nucleic Acids Res 2024; 52:1090-1106. [PMID: 38055834 PMCID: PMC10853796 DOI: 10.1093/nar/gkad1161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/15/2023] [Accepted: 11/17/2023] [Indexed: 12/08/2023] Open
Abstract
Exonic sequences contain both protein-coding and RNA splicing information but the interplay of the protein and splicing code is complex and poorly understood. Here, we have studied traditional and auxiliary splicing codes of human exons that encode residues coordinating two essential divalent metals at the opposite ends of the Irving-Williams series, a universal order of relative stabilities of metal-organic complexes. We show that exons encoding Zn2+-coordinating amino acids are supported much less by the auxiliary splicing motifs than exons coordinating Ca2+. The handicap of the former is compensated by stronger splice sites and uridine-richer polypyrimidine tracts, except for position -3 relative to 3' splice junctions. However, both Ca2+ and Zn2+ exons exhibit close-to-constitutive splicing in multiple tissues, consistent with their critical importance for metalloprotein function and a relatively small fraction of expendable, alternatively spliced exons. These results indicate that constraints imposed by metal coordination spheres on RNA splicing have been efficiently overcome by the plasticity of exon-intron architecture to ensure adequate metalloprotein expression.
Collapse
Affiliation(s)
- Dara Bakhtiar
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
| | - Katarina Vondraskova
- Slovak Academy of Sciences, Centre of Biosciences, 840 05 Bratislava, Slovak Republic
| | - Reuben J Pengelly
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
| | - Martin Chivers
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
| | - Jana Kralovicova
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
- Slovak Academy of Sciences, Centre of Biosciences, 840 05 Bratislava, Slovak Republic
| | - Igor Vorechovsky
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
| |
Collapse
|
2
|
Ferese R, Scala S, Suppa A, Campopiano R, Asci F, Zampogna A, Chiaravalloti MA, Griguoli A, Storto M, Pardo AD, Giardina E, Zampatti S, Fornai F, Novelli G, Fanelli M, Zecca C, Logroscino G, Centonze D, Gambardella S. Cohort analysis of novel SPAST variants in SPG4 patients and implementation of in vitro and in vivo studies to identify the pathogenic mechanism caused by splicing mutations. Front Neurol 2023; 14:1296924. [PMID: 38145127 PMCID: PMC10748595 DOI: 10.3389/fneur.2023.1296924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 11/14/2023] [Indexed: 12/26/2023] Open
Abstract
Introduction Pure hereditary spastic paraplegia (SPG) type 4 (SPG4) is caused by mutations of SPAST gene. This study aimed to analyze SPAST variants in SPG4 patients to highlight the occurrence of splicing mutations and combine functional studies to assess the relevance of these variants in the molecular mechanisms of the disease. Methods We performed an NGS panel in 105 patients, in silico analysis for splicing mutations, and in vitro minigene assay. Results and discussion The NGS panel was applied to screen 105 patients carrying a clinical phenotype corresponding to upper motor neuron syndrome (UMNS), selectively affecting motor control of lower limbs. Pathogenic mutations in SPAST were identified in 12 patients (11.42%), 5 missense, 3 frameshift, and 4 splicing variants. Then, we focused on the patients carrying splicing variants using a combined approach of in silico and in vitro analysis through minigene assay and RNA, if available. For two splicing variants (i.e., c.1245+1G>A and c.1414-2A>T), functional assays confirm the types of molecular alterations suggested by the in silico analysis (loss of exon 9 and exon 12). In contrast, the splicing variant c.1005-1delG differed from what was predicted (skipping exon 7), and the functional study indicates the loss of frame and formation of a premature stop codon. The present study evidenced the high splice variants in SPG4 patients and indicated the relevance of functional assays added to in silico analysis to decipher the pathogenic mechanism.
Collapse
Affiliation(s)
| | | | - Antonio Suppa
- IRCCS Neuromed, Pozzilli, Italy
- Department of Human Neurosciences, Sapienza University of Rome, Rome, Italy
| | | | | | | | | | | | | | | | - Emiliano Giardina
- Genomic Medicine Laboratory, IRCCS Fondazione Santa Lucia, Rome, Italy
| | - Stefania Zampatti
- Genomic Medicine Laboratory, IRCCS Fondazione Santa Lucia, Rome, Italy
| | - Francesco Fornai
- IRCCS Neuromed, Pozzilli, Italy
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, Pisa, Italy
| | - Giuseppe Novelli
- IRCCS Neuromed, Pozzilli, Italy
- Department of Biomedicine and Prevention, University of Rome “Tor Vergata”, Rome, Italy
| | - Mirco Fanelli
- Department of Biomolecular Sciences, University of Urbino “Carlo Bo”, Urbino, Italy
| | - Chiara Zecca
- Center for Neurodegenerative Diseases and the Aging Brain, Department of Clinical Research in Neurology of the University of Bari “Aldo Moro” at “Pia Fondazione Card G. Panico” Hospital Tricase, Lecce, Italy
| | - Giancarlo Logroscino
- Center for Neurodegenerative Diseases and the Aging Brain, Department of Clinical Research in Neurology of the University of Bari “Aldo Moro” at “Pia Fondazione Card G. Panico” Hospital Tricase, Lecce, Italy
| | - Diego Centonze
- IRCCS Neuromed, Pozzilli, Italy
- Department of Systems Medicine, Tor Vergata University, Rome, Italy
| | - Stefano Gambardella
- IRCCS Neuromed, Pozzilli, Italy
- Department of Biomolecular Sciences, University of Urbino “Carlo Bo”, Urbino, Italy
| |
Collapse
|
3
|
Mir BA, Rehman MU, Tayara H, Chong KT. Improving Enhancer Identification with a Multi-Classifier Stacked Ensemble Model. J Mol Biol 2023; 435:168314. [PMID: 37852600 DOI: 10.1016/j.jmb.2023.168314] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 10/06/2023] [Accepted: 10/11/2023] [Indexed: 10/20/2023]
Abstract
Enhancers are DNA regions that are responsible for controlling the expression of genes. Enhancers are usually found upstream or downstream of a gene, or even inside a gene's intron region, but are normally located at a distant location from the genes they control. By integrating experimental and computational approaches, it is possible to uncover enhancers within DNA sequences, which possess regulatory properties. Experimental techniques such as ChIP-seq and ATAC-seq can identify genomic regions that are associated with transcription factors or accessible to regulatory proteins. On the other hand, computational techniques can predict enhancers based on sequence features and epigenetic modifications. In our study, we have developed a multi-classifier stacked ensemble (MCSE-enhancer) model that can accurately identify enhancers. We utilized feature descriptors from various physiochemical properties as input for our six baseline classifiers and built a stacked classifier, which outperformed previous enhancer classification techniques in terms of accuracy, specificity, sensitivity, and Mathew's correlation coefficient. Our model achieved an accuracy of 81.5%, representing a 2-3% improvement over existing models.
Collapse
Affiliation(s)
- Bilal Ahmad Mir
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.
| | - Mobeen Ur Rehman
- Khalifa University Center for Autonomous Robotic Systems (KUCARS), Khalifa University, Abu Dhabi 127788, United Arab Emirates.
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea; Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea.
| |
Collapse
|
4
|
Yang Z, Ye Z, Qiu J, Feng R, Li D, Hsieh C, Allcock J, Zhang S. A mutation-induced drug resistance database (MdrDB). Commun Chem 2023; 6:123. [PMID: 37316673 DOI: 10.1038/s42004-023-00920-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Accepted: 06/02/2023] [Indexed: 06/16/2023] Open
Abstract
Mutation-induced drug resistance is a significant challenge to the clinical treatment of many diseases, as structural changes in proteins can diminish drug efficacy. Understanding how mutations affect protein-ligand binding affinities is crucial for developing new drugs and therapies. However, the lack of a large-scale and high-quality database has hindered the research progresses in this area. To address this issue, we have developed MdrDB, a database that integrates data from seven publicly available datasets, which is the largest database of its kind. By integrating information on drug sensitivity and cell line mutations from Genomics of Drug Sensitivity in Cancer and DepMap, MdrDB has substantially expanded the existing drug resistance data. MdrDB is comprised of 100,537 samples of 240 proteins (which encompass 5119 total PDB structures), 2503 mutations, and 440 drugs. Each sample brings together 3D structures of wild type and mutant protein-ligand complexes, binding affinity changes upon mutation (ΔΔG), and biochemical features. Experimental results with MdrDB demonstrate its effectiveness in significantly enhancing the performance of commonly used machine learning models when predicting ΔΔG in three standard benchmarking scenarios. In conclusion, MdrDB is a comprehensive database that can advance the understanding of mutation-induced drug resistance, and accelerate the discovery of novel chemicals.
Collapse
Affiliation(s)
- Ziyi Yang
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | - Zhaofeng Ye
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | - Jiezhong Qiu
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | - Rongjun Feng
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | - Danyu Li
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | - Changyu Hsieh
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | | | - Shengyu Zhang
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China.
| |
Collapse
|
5
|
Rogalska ME, Vivori C, Valcárcel J. Regulation of pre-mRNA splicing: roles in physiology and disease, and therapeutic prospects. Nat Rev Genet 2023; 24:251-269. [PMID: 36526860 DOI: 10.1038/s41576-022-00556-8] [Citation(s) in RCA: 109] [Impact Index Per Article: 54.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/10/2022] [Indexed: 12/23/2022]
Abstract
The removal of introns from mRNA precursors and its regulation by alternative splicing are key for eukaryotic gene expression and cellular function, as evidenced by the numerous pathologies induced or modified by splicing alterations. Major recent advances have been made in understanding the structures and functions of the splicing machinery, in the description and classification of physiological and pathological isoforms and in the development of the first therapies for genetic diseases based on modulation of splicing. Here, we review this progress and discuss important remaining challenges, including predicting splice sites from genomic sequences, understanding the variety of molecular mechanisms and logic of splicing regulation, and harnessing this knowledge for probing gene function and disease aetiology and for the design of novel therapeutic approaches.
Collapse
Affiliation(s)
- Malgorzata Ewa Rogalska
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Claudia Vivori
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
- The Francis Crick Institute, London, UK
| | - Juan Valcárcel
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
6
|
Li J, Wu Z, Lin W, Luo J, Zhang J, Chen Q, Chen J. iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models. BIOINFORMATICS ADVANCES 2023; 3:vbad043. [PMID: 37113248 PMCID: PMC10125906 DOI: 10.1093/bioadv/vbad043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 02/04/2023] [Accepted: 03/24/2023] [Indexed: 04/29/2023]
Abstract
Motivation Enhancers are important cis-regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many feature extraction methods have been proposed to improve the performance of enhancer identification, they cannot learn position-related multiscale contextual information from raw DNA sequences. Results In this article, we propose a novel enhancer identification method (iEnhancer-ELM) based on BERT-like enhancer language models. iEnhancer-ELM tokenizes DNA sequences with multi-scale k-mers and extracts contextual information of different scale k-mers related with their positions via an multi-head attention mechanism. We first evaluate the performance of different scale k-mers, then ensemble them to improve the performance of enhancer identification. The experimental results on two popular benchmark datasets show that our model outperforms state-of-the-art methods. We further illustrate the interpretability of iEnhancer-ELM. For a case study, we discover 30 enhancer motifs via a 3-mer-based model, where 12 of motifs are verified by STREME and JASPAR, demonstrating our model has a potential ability to unveil the biological mechanism of enhancer. Availability and implementation The models and associated code are available at https://github.com/chen-bioinfo/iEnhancer-ELM. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Wenhao Lin
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Jiawei Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | | |
Collapse
|
7
|
Vořechovský I. Selection of Olduvai Domains during Evolution: A Role for Primate-Specific Splicing Super-Enhancer and RNA Guanine Quadruplex in Bipartite NBPF Exons. Brain Sci 2022; 12:874. [PMID: 35884681 PMCID: PMC9313022 DOI: 10.3390/brainsci12070874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 06/23/2022] [Accepted: 06/27/2022] [Indexed: 02/04/2023] Open
Abstract
Olduvai protein domains (also known as DUF1220 or NBPF) have undergone the greatest human-specific increase in the copy number of any coding region in the genome. Their repeat number was strongly associated with the evolutionary expansion of brain volumes, neuron counts and cognitive abilities, as well as with disorders of the autistic spectrum. Nevertheless, the domain function and cellular mechanisms underlying the positive selection of Olduvai DNA sequences in higher primates remain obscure. Here, I show that the inclusion of Olduvai exon doublets in mature transcripts is facilitated by a potent splicing enhancer that was created through duplication within the first exon. The enhancer is the strongest among the NBPF transcripts and further promotes the already high splicing activity of the unexpanded first exons of the two-exon domains, safeguarding the expanded Olduvai exon doublets in the mature transcriptome. The duplication also creates a predicted RNA guanine quadruplex that may regulate the access to spliceosomal components of the super-enhancer and influence the splicing of adjacent exons. Thus, positive Olduvai selection during primate evolution is likely to result from a combination of multiple targets in gene expression pathways, including RNA splicing.
Collapse
Affiliation(s)
- Igor Vořechovský
- Faculty of Medicine, University of Southampton, HDH, MP808, Southampton SO16 6YD, UK
| |
Collapse
|
8
|
Pengelly RJ, Bakhtiar D, Borovská I, Královičová J, Vořechovský I. Exonic splicing code and protein binding sites for calcium. Nucleic Acids Res 2022; 50:5493-5512. [PMID: 35474482 PMCID: PMC9177970 DOI: 10.1093/nar/gkac270] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 04/01/2022] [Accepted: 04/05/2022] [Indexed: 11/12/2022] Open
Abstract
Auxilliary splicing sequences in exons, known as enhancers (ESEs) and silencers (ESSs), have been subject to strong selection pressures at the RNA and protein level. The protein component of this splicing code is substantial, recently estimated at ∼50% of the total information within ESEs, but remains poorly understood. The ESE/ESS profiles were previously associated with the Irving-Williams (I-W) stability series for divalent metals, suggesting that the ESE/ESS evolution was shaped by metal binding sites. Here, we have examined splicing activities of exonic sequences that encode protein binding sites for Ca2+, a weak binder in the I-W affinity order. We found that predicted exon inclusion levels for the EF-hand motifs and for Ca2+-binding residues in nonEF-hand proteins were higher than for average exons. For canonical EF-hands, the increase was centred on the EF-hand chelation loop and, in particular, on Ca2+-coordinating residues, with a 1>12>3∼5>9 hierarchy in the 12-codon loop consensus and usage bias at codons 1 and 12. The same hierarchy but a lower increase was observed for noncanonical EF-hands, except for S100 proteins. EF-hand loops preferentially accumulated exon splits in two clusters, one located in their N-terminal halves and the other around codon 12. Using splicing assays and published crosslinking and immunoprecipitation data, we identify candidate trans-acting factors that preferentially bind conserved GA-rich motifs encoding negatively charged amino acids in the loops. Together, these data provide evidence for the high capacity of codons for Ca2+-coordinating residues to be retained in mature transcripts, facilitating their exon-level expansion during eukaryotic evolution.
Collapse
Affiliation(s)
- Reuben J Pengelly
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
| | - Dara Bakhtiar
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
| | - Ivana Borovská
- Slovak Academy of Sciences, Centre of Biosciences, 840 05 Bratislava, Slovak Republic
| | - Jana Královičová
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
- Slovak Academy of Sciences, Centre of Biosciences, 840 05 Bratislava, Slovak Republic
- Slovak Academy of Sciences, Institute of Zoology, 845 06 Bratislava, Slovak Republic
| | - Igor Vořechovský
- University of Southampton, Faculty of Medicine, Southampton SO16 6YD, UK
| |
Collapse
|
9
|
Oman M, Alam A, Ness RW. How sequence context-dependent mutability drives mutation rate variation in the genome. Genome Biol Evol 2022; 14:6537538. [PMID: 35218359 PMCID: PMC8920511 DOI: 10.1093/gbe/evac032] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/21/2022] [Indexed: 11/12/2022] Open
Abstract
The rate of mutations varies >100-fold across the genome, altering the rate of evolution, and susceptibility to genetic diseases. The strongest predictor of mutation rate is the sequence itself, varying 75-fold between trinucleotides. The fact that DNA sequence drives its own mutation rate raises a simple but important prediction; highly mutable sequences will mutate more frequently and eliminate themselves in favor of sequences with lower mutability, leading to a lower equilibrium mutation rate. However, purifying selection constrains changes in mutable sequences, causing higher rates of mutation. We conduct a simulation using real human mutation data to test if 1) DNA evolves to a low equilibrium mutation rate and 2) purifying selection causes a higher equilibrium mutation rate in the genome’s most important regions. We explore how this simple process affects sequence evolution in the genome, and discuss the implications for modeling evolution and susceptibility to DNA damage.
Collapse
Affiliation(s)
- Madeleine Oman
- Dept of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada.,Dept of Biology, University of Toronto, Mississauga, Canada
| | - Aqsa Alam
- Dept of Cell and Systems Biology, University of Toronto, Toronto, Canada
| | - Rob W Ness
- Dept of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada.,Dept of Biology, University of Toronto, Mississauga, Canada.,Dept of Cell and Systems Biology, University of Toronto, Toronto, Canada
| |
Collapse
|
10
|
Abrahams L, Savisaar R, Mordstein C, Young B, Kudla G, Hurst LD. Evidence in disease and non-disease contexts that nonsense mutations cause altered splicing via motif disruption. Nucleic Acids Res 2021; 49:9665-9685. [PMID: 34469537 PMCID: PMC8464065 DOI: 10.1093/nar/gkab750] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 08/17/2021] [Accepted: 08/19/2021] [Indexed: 12/21/2022] Open
Abstract
Transcripts containing premature termination codons (PTCs) can be subject to nonsense-associated alternative splicing (NAS). Two models have been evoked to explain this, scanning and splice motif disruption. The latter postulates that exonic cis motifs, such as exonic splice enhancers (ESEs), are disrupted by nonsense mutations. We employ genome-wide transcriptomic and k-mer enrichment methods to scrutinize this model. First, we show that ESEs are prone to disruptive nonsense mutations owing to their purine richness and paucity of TGA, TAA and TAG. The motif model correctly predicts that NAS rates should be low (we estimate 5–30%) and approximately in line with estimates for the rate at which random point mutations disrupt splicing (8–20%). Further, we find that, as expected, NAS-associated PTCs are predictable from nucleotide-based machine learning approaches to predict splice disruption and, at least for pathogenic variants, are enriched in ESEs. Finally, we find that both in and out of frame mutations to TAA, TGA or TAG are associated with exon skipping. While a higher relative frequency of such skip-inducing mutations in-frame than out of frame lends some credence to the scanning model, these results reinforce the importance of considering splice motif modulation to understand the etiology of PTC-associated disease.
Collapse
Affiliation(s)
- Liam Abrahams
- The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK
| | - Rosina Savisaar
- The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK.,Instituto de Medicina Molecular João Lobo Antunes, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portugal
| | - Christine Mordstein
- The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK.,MRC Human Genetics Unit, The University of Edinburgh, Crewe Road, Edinburgh EH4 2XU, UK.,Aarhus University, Department of Molecular Biology and Genetics, C F Møllers Allé 3, 8000 Aarhus, Denmark
| | - Bethan Young
- MRC Human Genetics Unit, The University of Edinburgh, Crewe Road, Edinburgh EH4 2XU, UK
| | - Grzegorz Kudla
- MRC Human Genetics Unit, The University of Edinburgh, Crewe Road, Edinburgh EH4 2XU, UK
| | - Laurence D Hurst
- The Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK
| |
Collapse
|
11
|
Basith S, Hasan MM, Lee G, Wei L, Manavalan B. Integrative machine learning framework for the identification of cell-specific enhancers from the human genome. Brief Bioinform 2021; 22:6315815. [PMID: 34226917 DOI: 10.1093/bib/bbab252] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 06/08/2021] [Accepted: 06/14/2021] [Indexed: 02/06/2023] Open
Abstract
Enhancers are deoxyribonucleic acid (DNA) fragments which when bound by transcription factors enhance the transcription of related genes. Due to its sporadic distribution and similar fractions, identification of enhancers from the human genome seems a daunting task. Compared to the traditional experimental approaches, computational methods with easy-to-use platforms could be efficiently applied to annotate enhancers' functions and physiological roles. In this aspect, several bioinformatics tools have been developed to identify enhancers. Despite their spectacular performances, existing methods have certain drawbacks and limitations, including fixed length of sequences being utilized for model development and cell-specificity negligence. A novel predictor would be beneficial in the context of genome-wide enhancer prediction by addressing the above-mentioned issues. In this study, we constructed new datasets for eight different cell types. Utilizing these data, we proposed an integrative machine learning (ML)-based framework called Enhancer-IF for identifying cell-specific enhancers. Enhancer-IF comprehensively explores a wide range of heterogeneous features with five commonly used ML methods (random forest, extremely randomized tree, multilayer perceptron, support vector machine and extreme gradient boosting). Specifically, these five classifiers were trained with seven encodings and obtained 35 baseline models. The output of these baseline models was integrated and again inputted to five classifiers for the construction of five meta-models. Finally, the integration of five meta-models through ensemble learning improved the model robustness. Our proposed approach showed an excellent prediction performance compared to the baseline models on both training and independent datasets in different cell types, thus highlighting the superiority of our approach in the identification of the enhancers. We assume that Enhancer-IF will be a valuable tool for screening and identifying potential enhancers from the human DNA sequences.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Md Mehedi Hasan
- Tulane University, USA.,Kyushu Institute of Technology, Japan
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Leyi Wei
- Xiamen University, China.,Shandong University, China
| | | |
Collapse
|