1
|
Tenekeci S, Tekir S. Identifying promoter and enhancer sequences by graph convolutional networks. Comput Biol Chem 2024; 110:108040. [PMID: 38430611 DOI: 10.1016/j.compbiolchem.2024.108040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 01/09/2024] [Accepted: 02/27/2024] [Indexed: 03/05/2024]
Abstract
Identification of promoters, enhancers, and their interactions helps understand genetic regulation. This study proposes a graph-based semi-supervised learning model (GCN4EPI) for the enhancer-promoter classification problem. We adopt a graph convolutional network (GCN) architecture to integrate interaction information with sequence features. Nodes of the constructed graph hold word embeddings of DNA sequences while edges hold the Enhancer-Promoter Interaction (EPI) information. By means of semi-supervised learning, much less data (16%) and time are needed in model training. Comparisons on a benchmark dataset of six human cell lines show that the proposed approach outperforms the state-of-the-art methods by a large margin (10% higher F1 score) and has the fastest training time (up to 3 times). Moreover, GCN4EPI's performance on cross-cell line data is also better than the baselines (3% higher F1 score). Our qualitative analyses with graph explainability models prove that GCN4EPI learns from both text and graph structure. The results suggest that integrating interaction information with sequence features improves predictive performance and compensates for the number of training instances.
Collapse
Affiliation(s)
- Samet Tenekeci
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye
| | - Selma Tekir
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye.
| |
Collapse
|
2
|
Li Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, Cui F, Zhang Z. msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol 2024; 22:126. [PMID: 38816885 DOI: 10.1186/s12915-024-01923-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 05/21/2024] [Indexed: 06/01/2024] Open
Abstract
BACKGROUND A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches. RESULTS In this study, a two-stage integrated predictor called "msBERT-Promoter" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability. CONCLUSIONS msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.
Collapse
Affiliation(s)
- Yazi Li
- School of Mathematics and Statistics, Hainan University, Haikou, 570228, China
| | - Xiaoman Wei
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Qinglin Yang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - An Xiong
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| |
Collapse
|
3
|
DeGroat W, Inoue F, Ashuach T, Yosef N, Ahituv N, Kreimer A. Comprehensive network modeling approaches unravel dynamic enhancer-promoter interactions across neural differentiation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.22.595375. [PMID: 38826254 PMCID: PMC11142193 DOI: 10.1101/2024.05.22.595375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Background Increasing evidence suggests that a substantial proportion of disease-associated mutations occur in enhancers, regions of non-coding DNA essential to gene regulation. Understanding the structures and mechanisms of regulatory programs this variation affects can shed light on the apparatuses of human diseases. Results We collected epigenetic and gene expression datasets from seven early time points during neural differentiation. Focusing on this model system, we constructed networks of enhancer-promoter interactions, each at an individual stage of neural induction. These networks served as the base for a rich series of analyses, through which we demonstrated their temporal dynamics and enrichment for various disease-associated variants. We applied the Girvan-Newman clustering algorithm to these networks to reveal biologically relevant substructures of regulation. Additionally, we demonstrated methods to validate predicted enhancer-promoter interactions using transcription factor overexpression and massively parallel reporter assays. Conclusions Our findings suggest a generalizable framework for exploring gene regulatory programs and their dynamics across developmental processes. This includes a comprehensive approach to studying the effects of disease-associated variation on transcriptional networks. The techniques applied to our networks have been published alongside our findings as a computational tool, E-P-INAnalyzer. Our procedure can be utilized across different cellular contexts and disorders.
Collapse
Affiliation(s)
- William DeGroat
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, NJ 08854, UAS
| | - Fumitaka Inoue
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Tal Ashuach
- Department of Electrical Engineering and Computer Sciences and Center for Computational Biology, University of California, Berkeley, 387 Soda Hall, Berkeley, CA 94720, USA
| | - Nir Yosef
- Department of Systems Immunology, Weizmann Institute of Science, 234 Herzl Street, Rehovot 7610001, Israel
- Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA
- Department of Systems Immunology, Ragon Institute of MGH, MIT, and Harvard Institute of Science, 400 Technology Square, Cambridge, MA 02139, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 513 Parnassus Ave, CA 94143, USA
- Institute for Human Genetics, University of California, San Francisco, 513 Parnassus Ave, CA 94143, USA
| | - Anat Kreimer
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, NJ 08854, UAS
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, NJ 08854, USA
| |
Collapse
|
4
|
Semenov GA, Sonnenberg BR, Branch CL, Heinen VK, Welklin JF, Padula SR, Patel AM, Bridge ES, Pravosudov VV, Taylor SA. Genes and gene networks underlying spatial cognition in food-caching chickadees. Curr Biol 2024; 34:1930-1939.e4. [PMID: 38636515 DOI: 10.1016/j.cub.2024.03.058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 12/06/2023] [Accepted: 03/26/2024] [Indexed: 04/20/2024]
Abstract
Substantial progress has been made in understanding the genetic architecture of phenotypes involved in a variety of evolutionary processes. Behavioral genetics remains, however, among the least understood. We explore the genetic architecture of spatial cognitive abilities in a wild passerine bird, the mountain chickadee (Poecile gambeli). Mountain chickadees cache thousands of seeds in the fall and require specialized spatial memory to recover these caches throughout the winter. We previously showed that variation in spatial cognition has a direct effect on fitness and has a genetic basis. It remains unknown which specific genes and developmental pathways are particularly important for shaping spatial cognition. To further dissect the genetic basis of spatial cognitive abilities, we combine experimental quantification of spatial cognition in wild chickadees with whole-genome sequencing of 162 individuals, a new chromosome-scale reference genome, and species-specific gene annotation. We have identified a set of genes and developmental pathways that play a key role in creating variation in spatial cognition and found that the mechanism shaping cognitive variation is consistent with selection against mildly deleterious non-coding mutations. Although some candidate genes were organized into connected gene networks, about half do not have shared regulation, highlighting that multiple independent developmental or physiological mechanisms contribute to variation in spatial cognitive abilities. A large proportion of the candidate genes we found are associated with synaptic plasticity, an intriguing result that leads to the hypothesis that certain genetic variants create antagonism between behavioral plasticity and long-term memory, each providing distinct benefits depending on ecological context.
Collapse
Affiliation(s)
- Georgy A Semenov
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, Boulder, CO 80309, USA.
| | - Benjamin R Sonnenberg
- Department of Biology and Evolution, Ecology Evolution and Conservation Biology Graduate Program, University of Nevada, Reno, NV 89557, USA
| | - Carrie L Branch
- Department of Psychology, The University of Western Ontario, London, ON N6A 3K7, Canada
| | - Virginia K Heinen
- Department of Biology and Evolution, Ecology Evolution and Conservation Biology Graduate Program, University of Nevada, Reno, NV 89557, USA
| | - Joseph F Welklin
- Department of Biology and Evolution, Ecology Evolution and Conservation Biology Graduate Program, University of Nevada, Reno, NV 89557, USA
| | - Sara R Padula
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, Boulder, CO 80309, USA
| | - Ajay M Patel
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, Boulder, CO 80309, USA
| | - Eli S Bridge
- Oklahoma Biological Survey, University of Oklahoma, Norman, OK 73019, USA
| | - Vladimir V Pravosudov
- Department of Biology and Evolution, Ecology Evolution and Conservation Biology Graduate Program, University of Nevada, Reno, NV 89557, USA
| | - Scott A Taylor
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, Boulder, CO 80309, USA
| |
Collapse
|
5
|
Lei R, Jia J, Qin L, Wei X. iPro2L-DG: Hybrid network based on improved densenet and global attention mechanism for identifying promoter sequences. Heliyon 2024; 10:e27364. [PMID: 38510021 PMCID: PMC10950492 DOI: 10.1016/j.heliyon.2024.e27364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 02/24/2024] [Accepted: 02/28/2024] [Indexed: 03/22/2024] Open
Abstract
The promoter is a key DNA sequence whose primary function is to control the initiation time and the degree of expression of gene transcription. Accurate identification of promoters is essential for understanding gene expression studies. Traditional sequencing techniques for identifying promoters are costly and time-consuming. Therefore, the development of computational methods to identify promoters has become critical. Since deep learning methods show great potential in identifying promoters, this study proposes a new promoter prediction model, called iPro2L-DG. The iPro2L-DG predictor, based on an improved Densely Connected Convolutional Network (DenseNet) and a Global Attention Mechanism (GAM), is constructed to achieve the prediction of promoters. The promoter sequences are combined feature encoding using C2 encoding and nucleotide chemical property (NCP) encoding. An improved DenseNet extracts advanced feature information from the combined feature encoding. GAM evaluates the importance of advanced feature information in terms of channel and spatial dimensions, and finally uses a Full Connect Neural Network (FNN) to derive prediction probabilities. The experimental results showed that the accuracy of iPro2L-DG in the first layer (promoter identification) was 94.10% with Matthews correlation coefficient value of 0.8833. In the second layer (promoter strength prediction), the accuracy was 89.42% with Matthews correlation coefficient value of 0.7915. The iPro2L-DG predictor significantly outperforms other existing predictors in promoter identification and promoter strength prediction. Therefore, our proposed model iPro2L-DG is the most advanced promoter prediction tool. The source code of the iPro2L-DG model can be found in https://github.com/leirufeng/iPro2L-DG.
Collapse
Affiliation(s)
- Rufeng Lei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Lulu Qin
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Xin Wei
- Business School, Jiangxi Institute of Fashion Technology, Nanchang, 330044, China
| |
Collapse
|
6
|
Ramakrishnan A, Wangensteen G, Kim S, Nestler EJ, Shen L. DeepRegFinder: deep learning-based regulatory elements finder. BIOINFORMATICS ADVANCES 2024; 4:vbae007. [PMID: 38343388 PMCID: PMC10858349 DOI: 10.1093/bioadv/vbae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 12/06/2023] [Accepted: 01/12/2024] [Indexed: 06/15/2024]
Abstract
Summary Enhancers and promoters are important classes of DNA regulatory elements (DREs) that govern gene expression. Identifying them at a genomic scale is a critical task in bioinformatics. The DREs often exhibit unique histone mark binding patterns, which can be captured by high-throughput ChIP-seq experiments. To account for the variations and noises among the binding sites, machine learning models are trained on known enhancer/promoter sites using histone mark ChIP-seq data and predict enhancers/promoters at other genomic regions. To this end, we have developed a highly customizable program named DeepRegFinder, which automates the entire process of data processing, model training, and prediction. We have employed convolutional and recurrent neural networks for model training and prediction. DeepRegFinder further categorizes enhancers and promoters into active and poised states, making it a unique and valuable feature for researchers. Our method demonstrates improved precision and recall in comparison to existing algorithms for enhancer prediction across multiple cell types. Moreover, our pipeline is modular and eliminates the tedious steps involved in preprocessing, making it easier for users to apply on their data quickly. Availability and implementation https://github.com/shenlab-sinai/DeepRegFinder.
Collapse
Affiliation(s)
- Aarthi Ramakrishnan
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - George Wangensteen
- Department of Computer Science, Brown University, Providence, RI 02912, United States
| | - Sarah Kim
- Cancer Program, Broad Institute, Cambridge, MA 02142, United States
| | - Eric J Nestler
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Li Shen
- Friedman Brain Institute and Nash Family Department of Neuroscience, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| |
Collapse
|
7
|
Betti MJ, Aldrich MC, Gamazon ER. Minimum entropy framework identifies a novel class of genomic functional elements and reveals regulatory mechanisms at human disease loci. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.11.544507. [PMID: 37398170 PMCID: PMC10312628 DOI: 10.1101/2023.06.11.544507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
We introduce CoRE-BED, a framework trained using 19 epigenomic features in 33 major cell and tissue types to predict cell-type-specific regulatory function. CoRE-BED identifies nine functional classes de-novo, capturing both known and new regulatory categories. Notably, we describe a previously undercharacterized class that we term Development Associated Elements (DAEs), which are highly enriched in cell types with elevated regenerative potential and distinguished by the dual presence of either H3K4me2 and H3K9ac (an epigenetic signature associated with kinetochore assembly) or H3K79me3 and H4K20me1 (a signature associated with transcriptional pause release). Unlike bivalent promoters, which represent a transitory state between active and silenced promoters, DAEs transition directly to or from a non-functional state during stem cell differentiation and are proximal to highly expressed genes. CoRE-BED's interpretability facilitates causal inference and functional prioritization. Across 70 complex traits, distal insulators account for the largest mean proportion of SNP heritability (~49%) captured by the GWAS. Collectively, our results demonstrate the value of exploring non-conventional ways of regulatory classification that enrich for trait heritability, to complement existing approaches for cis-regulatory prediction.
Collapse
Affiliation(s)
| | | | - Eric R Gamazon
- Vanderbilt University Medical Center, Nashville, TN
- Clare Hall, University of Cambridge, Cambridge, England
| |
Collapse
|
8
|
He S, Gao B, Sabnis R, Sun Q. Nucleic Transformer: Classifying DNA Sequences with Self-Attention and Convolutions. ACS Synth Biol 2023; 12:3205-3214. [PMID: 37916871 PMCID: PMC10863451 DOI: 10.1021/acssynbio.3c00154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 10/04/2023] [Accepted: 10/06/2023] [Indexed: 11/03/2023]
Abstract
Much work has been done to apply machine learning and deep learning to genomics tasks, but these applications usually require extensive domain knowledge, and the resulting models provide very limited interpretability. Here, we present the Nucleic Transformer, a conceptually simple but effective and interpretable model architecture that excels in the classification of DNA sequences. The Nucleic Transformer employs self-attention and convolutions on nucleic acid sequences, leveraging two prominent deep learning strategies commonly used in computer vision and natural language analysis. We demonstrate that the Nucleic Transformer can be trained without much domain knowledge to achieve high performance in Escherichia coli promoter classification, viral genome identification, enhancer classification, and chromatin profile predictions.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Baizhen Gao
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Rushant Sabnis
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Qing Sun
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| |
Collapse
|
9
|
Pan D, Su M, Xu D, Wang Y, Gao H, Smith JD, Sun J, Wang X, Yan Q, Song G, Lu Y, Feng W, Wang S, Sun G. Exploring the Interplay Between Vitamin B 12-related Biomarkers, DNA Methylation, and Gene-Nutrition Interaction in Esophageal Precancerous Lesions. Arch Med Res 2023; 54:102889. [PMID: 37738887 DOI: 10.1016/j.arcmed.2023.102889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 06/29/2023] [Accepted: 08/31/2023] [Indexed: 09/24/2023]
Abstract
BACKGROUND Vitamin B12 depletion has been suggested to be associated with esophageal precancerous lesions (EPL). However, the potential mechanisms remain unclear. AIMS This study aims to evaluate the role of vitamin B12 and its regulated epigenetic modification in EPL and provide preliminary information on the identification of potential molecular biomarkers for the early prediction of EPL. METHODS We collected information and samples from the Early Diagnosis and Early Treatment Project of Esophageal Cancer database from 200 EPL cases and 200 matched controls. Vitamin B12, one-carbon metabolism biomarkers, genetic polymorphism of TCN2 C776G, and DNA methylation were compared. Preliminarily identified candidate promoters of differentially methylated CpG positions were further verified by targeted bisulfite sequencing. RESULTS EPL cases had significantly lower serum levels of vitamin B12 and transcobalamin II, and higher serum levels of homocysteine and 5-methyltetrahydrofolate than controls. The TCN2 C776G polymorphism was found to be associated with susceptibility to EPL and may interact with vitamin B12 nutritional status to influence the risk of EPL in male subjects. In addition, global hypomethylation related to vitamin B12 depletion was observed in EPL cases, along with region-specific hypermethylation of UGT2B15 and FGFR2 promoters. CONCLUSIONS This study suggests that vitamin B12 depletion may be associated with aberrant DNA methylation and increased risk of EPL through the one-carbon metabolism pathway, presents that the TCN2 C776G polymorphism may interact with vitamin B12 nutritional status to affect EPL risk in males, and also identifies specific locations in the UGT2B15 and FGFR2 promoters with potential as promising molecular biomarkers.
Collapse
Affiliation(s)
- Da Pan
- Key Laboratory of Environmental Medicine and Engineering of the Ministry of Education, and Department of Nutrition and Food Hygiene, School of Public Health, Southeast University, Nanjing, PR China
| | - Ming Su
- Huai'an District Center for Disease Control and Prevention, Huai'an, PR China
| | - Dengfeng Xu
- Key Laboratory of Environmental Medicine and Engineering of the Ministry of Education, and Department of Nutrition and Food Hygiene, School of Public Health, Southeast University, Nanjing, PR China
| | - Yuanyuan Wang
- Key Laboratory of Environmental Medicine and Engineering of the Ministry of Education, and Department of Nutrition and Food Hygiene, School of Public Health, Southeast University, Nanjing, PR China
| | - Han Gao
- Department of Biomedical Engineering, University Medical Center Groningen/University of Groningen, The Netherlands; Drug Research Program, Division of Pharmaceutical Chemistry and Technology, Faculty of Pharmacy, University of Helsinki, Helsinki, Finland
| | | | - Jihan Sun
- Key Laboratory of Environmental Medicine and Engineering of the Ministry of Education, and Department of Nutrition and Food Hygiene, School of Public Health, Southeast University, Nanjing, PR China
| | - Xin Wang
- Huai'an District Center for Disease Control and Prevention, Huai'an, PR China
| | - Qingyang Yan
- Huai'an District Center for Disease Control and Prevention, Huai'an, PR China
| | - Guang Song
- Huai'an District Center for Disease Control and Prevention, Huai'an, PR China
| | - Yifei Lu
- Key Laboratory of Environmental Medicine and Engineering of the Ministry of Education, and Department of Nutrition and Food Hygiene, School of Public Health, Southeast University, Nanjing, PR China
| | - Wuqiong Feng
- Huai'an District Center for Disease Control and Prevention, Huai'an, PR China
| | - Shaokang Wang
- Key Laboratory of Environmental Medicine and Engineering of the Ministry of Education, and Department of Nutrition and Food Hygiene, School of Public Health, Southeast University, Nanjing, PR China; School of Medicine, Xizang Minzu University, Xianyang, PR China
| | - Guiju Sun
- Key Laboratory of Environmental Medicine and Engineering of the Ministry of Education, and Department of Nutrition and Food Hygiene, School of Public Health, Southeast University, Nanjing, PR China.
| |
Collapse
|
10
|
Kwon MJ, Kim JH, Kim KJ, Ko EJ, Lee JY, Ryu CS, Ha YH, Kim YR, Kim NK. Genetic Association between Inflammatory-Related Polymorphism in STAT3, IL-1β, IL-6, TNF-α and Idiopathic Recurrent Implantation Failure. Genes (Basel) 2023; 14:1588. [PMID: 37628639 PMCID: PMC10454471 DOI: 10.3390/genes14081588] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 08/03/2023] [Accepted: 08/04/2023] [Indexed: 08/27/2023] Open
Abstract
Recurrent implantation failure (RIF) is defined as a failure to achieve pregnancy after multiple embryo transfers. Implantation is closely related to inflammatory gradients, and interleukin-1beta (IL-1β), IL-6, and tumor necrosis factor-alpha (TNF-α) play a key role in maternal and trophoblast inflammation during implantation. Signal transducer and activator of transcription 3 (STAT3) interacts with cytokines and plays a critical role in implantation through involvement in the inflammation of the embryo and placenta. Therefore, we investigated 151 RIF patients and 321 healthy controls in Korea and analyzed the association between the polymorphisms (STAT3 rs1053004, IL-1β rs16944, IL-6 rs1800796, and TNF-α rs1800629, 1800630) and RIF prevalence. In this paper, we identified that STAT3 rs1053004 (AG, adjusted odds rate [AOR] = 0.623; p = 0.027; GG, AOR = 0.513; p = 0.043; Dominant, AOR = 0.601, p = 0.011), IL-6 rs1800796 (GG, AOR = 2.472; p = 0.032; Recessive, AOR = 2.374, p = 0.037), and TNF-α rs1800629 (GA, AOR = 2.127, p = 0.010, Dominant, AOR = 2.198, p = 0.007) have a significant association with RIF prevalence. This study is the first to investigate the association of each polymorphism with RIF prevalence in Korea and to compare their effect based on their function on inflammation.
Collapse
Affiliation(s)
- Min Jung Kwon
- Department of Biomedical Science, College of Life Science, CHA University, Seongnam 13496, Republic of Korea; (M.J.K.); (K.J.K.); (E.J.K.); (J.Y.L.); (C.S.R.); (Y.H.H.)
| | - Ji Hyang Kim
- Department of Obstetrics and Gynecology, CHA Bundang Medical Center, School of Medicine, CHA University, Seongnam 13496, Republic of Korea;
| | - Kyu Jae Kim
- Department of Biomedical Science, College of Life Science, CHA University, Seongnam 13496, Republic of Korea; (M.J.K.); (K.J.K.); (E.J.K.); (J.Y.L.); (C.S.R.); (Y.H.H.)
| | - Eun Ju Ko
- Department of Biomedical Science, College of Life Science, CHA University, Seongnam 13496, Republic of Korea; (M.J.K.); (K.J.K.); (E.J.K.); (J.Y.L.); (C.S.R.); (Y.H.H.)
| | - Jeong Yong Lee
- Department of Biomedical Science, College of Life Science, CHA University, Seongnam 13496, Republic of Korea; (M.J.K.); (K.J.K.); (E.J.K.); (J.Y.L.); (C.S.R.); (Y.H.H.)
| | - Chang Su Ryu
- Department of Biomedical Science, College of Life Science, CHA University, Seongnam 13496, Republic of Korea; (M.J.K.); (K.J.K.); (E.J.K.); (J.Y.L.); (C.S.R.); (Y.H.H.)
| | - Yong Hyun Ha
- Department of Biomedical Science, College of Life Science, CHA University, Seongnam 13496, Republic of Korea; (M.J.K.); (K.J.K.); (E.J.K.); (J.Y.L.); (C.S.R.); (Y.H.H.)
| | - Young Ran Kim
- Department of Obstetrics and Gynecology, CHA Bundang Medical Center, School of Medicine, CHA University, Seongnam 13496, Republic of Korea;
| | - Nam Keun Kim
- Department of Biomedical Science, College of Life Science, CHA University, Seongnam 13496, Republic of Korea; (M.J.K.); (K.J.K.); (E.J.K.); (J.Y.L.); (C.S.R.); (Y.H.H.)
| |
Collapse
|
11
|
Milito A, Aschern M, McQuillan JL, Yang JS. Challenges and advances towards the rational design of microalgal synthetic promoters in Chlamydomonas reinhardtii. JOURNAL OF EXPERIMENTAL BOTANY 2023; 74:3833-3850. [PMID: 37025006 DOI: 10.1093/jxb/erad100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 03/24/2023] [Indexed: 06/19/2023]
Abstract
Microalgae hold enormous potential to provide a safe and sustainable source of high-value compounds, acting as carbon-fixing biofactories that could help to mitigate rapidly progressing climate change. Bioengineering microalgal strains will be key to optimizing and modifying their metabolic outputs, and to render them competitive with established industrial biotechnology hosts, such as bacteria or yeast. To achieve this, precise and tuneable control over transgene expression will be essential, which would require the development and rational design of synthetic promoters as a key strategy. Among green microalgae, Chlamydomonas reinhardtii represents the reference species for bioengineering and synthetic biology; however, the repertoire of functional synthetic promoters for this species, and for microalgae generally, is limited in comparison to other commercial chassis, emphasizing the need to expand the current microalgal gene expression toolbox. Here, we discuss state-of-the-art promoter analyses, and highlight areas of research required to advance synthetic promoter development in C. reinhardtii. In particular, we exemplify high-throughput studies performed in other model systems that could be applicable to microalgae, and propose novel approaches to interrogating algal promoters. We lastly outline the major limitations hindering microalgal promoter development, while providing novel suggestions and perspectives for how to overcome them.
Collapse
Affiliation(s)
- Alfonsina Milito
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Barcelona, Spain
| | - Moritz Aschern
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Barcelona, Spain
| | - Josie L McQuillan
- Department of Chemical and Biological Engineering, University of Sheffield, Mappin Street, Sheffield, S1 3JD, UK
| | - Jae-Seong Yang
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Barcelona, Spain
| |
Collapse
|
12
|
He S, Gao B, Sabnis R, Sun Q. RNAdegformer: accurate prediction of mRNA degradation at nucleotide resolution with deep learning. Brief Bioinform 2023; 24:bbac581. [PMID: 36633966 PMCID: PMC9851316 DOI: 10.1093/bib/bbac581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 11/14/2022] [Accepted: 11/28/2022] [Indexed: 01/13/2023] Open
Abstract
Messenger RNA-based therapeutics have shown tremendous potential, as demonstrated by the rapid development of messenger RNA based vaccines for COVID-19. Nevertheless, distribution of mRNA vaccines worldwide has been hampered by mRNA's inherent thermal instability due to in-line hydrolysis, a chemical degradation reaction. Therefore, predicting and understanding RNA degradation is a crucial and urgent task. Here we present RNAdegformer, an effective and interpretable model architecture that excels in predicting RNA degradation. RNAdegformer processes RNA sequences with self-attention and convolutions, two deep learning techniques that have proved dominant in the fields of computer vision and natural language processing, while utilizing biophysical features of RNA. We demonstrate that RNAdegformer outperforms previous best methods at predicting degradation properties at nucleotide resolution for COVID-19 mRNA vaccines. RNAdegformer predictions also exhibit improved correlation with RNA in vitro half-life compared with previous best methods. Additionally, we showcase how direct visualization of self-attention maps assists informed decision-making. Further, our model reveals important features in determining mRNA degradation rates via leave-one-feature-out analysis.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Baizhen Gao
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Rushant Sabnis
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Qing Sun
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| |
Collapse
|
13
|
Mai DHA, Nguyen LT, Lee EY. TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT. Front Genet 2022; 13:1067562. [PMID: 36523764 PMCID: PMC9745317 DOI: 10.3389/fgene.2022.1067562] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/17/2022] [Indexed: 07/30/2023] Open
Abstract
Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the "black box" issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.
Collapse
|
14
|
Dong B, Li M, Jiang B, Gao B, Li D, Zhang T. Antimicrobial Peptides Prediction method based on sequence multidimensional feature embedding. Front Genet 2022; 13:1069558. [DOI: 10.3389/fgene.2022.1069558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 11/02/2022] [Indexed: 11/18/2022] Open
Abstract
Antimicrobial peptides (AMPs) are alkaline substances with efficient bactericidal activity produced in living organisms. As the best substitute for antibiotics, they have been paid more and more attention in scientific research and clinical application. AMPs can be produced from almost all organisms and are capable of killing a wide variety of pathogenic microorganisms. In addition to being antibacterial, natural AMPs have many other therapeutically important activities, such as wound healing, antioxidant and immunomodulatory effects. To discover new AMPs, the use of wet experimental methods is expensive and difficult, and bioinformatics technology can effectively solve this problem. Recently, some deep learning methods have been applied to the prediction of AMPs and achieved good results. To further improve the prediction accuracy of AMPs, this paper designs a new deep learning method based on sequence multidimensional representation. By encoding and embedding sequence features, and then inputting the model to identify AMPs, high-precision classification of AMPs and Non-AMPs with lengths of 10–200 is achieved. The results show that our method improved accuracy by 1.05% compared to the most advanced model in independent data validation without decreasing other indicators.
Collapse
|
15
|
Zhang ZM, Zhao JP, Wei PJ, Zheng CH. iPromoter-CLA: Identifying promoters and their strength by deep capsule networks with bidirectional long short-term memory. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 226:107087. [PMID: 36099675 DOI: 10.1016/j.cmpb.2022.107087] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 05/14/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE The promoter is a fragment of DNA and a specific sequence with transcriptional regulation function in DNA. Promoters are located upstream at the transcription start site, which is used to initiate downstream gene expression. So far, promoter identification is mainly achieved by biological methods, which often require more effort. It has become a more effective classification and prediction method to identify promoter types through computational methods. METHODS In this study, we proposed a new capsule network and recurrent neural network hybrid model to identify promoters and predict their strength. Firstly, we used one-hot to encode DNA sequence. Secondly, we used three one-dimensional convolutional layers, a one-dimensional convolutional capsule layer and digit capsule layer to learn local features. Thirdly, a bidirectional long short-time memory was utilized to extract global features. Finally, we adopted the self-attention mechanism to improve the contribution of relatively important features, which further enhances the performance of the model. RESULTS Our model attains a cross-validation accuracy of 86% and 73.46% in prokaryotic promoter recognition and their strength prediction, which showcases a better performance compared with the existing approaches in both the first layer promoter identification and the second layer promoter's strength prediction. CONCLUSIONS our model not only combines convolutional neural network and capsule layer but also uses a self-attention mechanism to better capture hidden information features from the perspective of sequence. Thus, we hope that our model can be widely applied to other components.
Collapse
Affiliation(s)
- Zhi-Min Zhang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Jian-Ping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.
| | - Pi-Jing Wei
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Chun-Hou Zheng
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China; School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
16
|
Omics Data and Data Representations for Deep Learning-Based Predictive Modeling. Int J Mol Sci 2022; 23:ijms232012272. [PMID: 36293133 PMCID: PMC9603455 DOI: 10.3390/ijms232012272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 10/03/2022] [Accepted: 10/12/2022] [Indexed: 11/25/2022] Open
Abstract
Medical discoveries mainly depend on the capability to process and analyze biological datasets, which inundate the scientific community and are still expanding as the cost of next-generation sequencing technologies is decreasing. Deep learning (DL) is a viable method to exploit this massive data stream since it has advanced quickly with there being successive innovations. However, an obstacle to scientific progress emerges: the difficulty of applying DL to biology, and this because both fields are evolving at a breakneck pace, thus making it hard for an individual to occupy the front lines of both of them. This paper aims to bridge the gap and help computer scientists bring their valuable expertise into the life sciences. This work provides an overview of the most common types of biological data and data representations that are used to train DL models, with additional information on the models themselves and the various tasks that are being tackled. This is the essential information a DL expert with no background in biology needs in order to participate in DL-based research projects in biomedicine, biotechnology, and drug discovery. Alternatively, this study could be also useful to researchers in biology to understand and utilize the power of DL to gain better insights into and extract important information from the omics data.
Collapse
|
17
|
DeeProPre: A promoter predictor based on deep learning. Comput Biol Chem 2022; 101:107770. [PMID: 36116322 DOI: 10.1016/j.compbiolchem.2022.107770] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 08/06/2022] [Accepted: 09/11/2022] [Indexed: 11/21/2022]
Abstract
The promoter is a DNA sequence recognized, bound and transcribed by RNA polymerase. It is usually located at the upstream or 5'end of the transcription start site (TSS). Studies have shown that the structure of the promoter affects its affinity for RNA polymerase, thus affecting the level of gene expression. Therefore, the correct identification of core promoter and common structural gene is of great significance in the field of biomedicine. At present, many methods have been proposed to improve the accuracy of promoter recognition, but the performances still need to be further improved. In this study, a deep learning algorithm (DeeProPre) based on bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) was proposed. Firstly, the supervised embedding layer was applied to map the sequence to a high-dimensional space. Secondly, two 1D convolutional layers, BiLSTM and attentional mechanism layer were used for extracting features. Finally, the full connection layer activated by Sigmoid function was used to obtain the probability of classification into target categories. This model can identify the promoter region of eukaryotes with high accuracy, providing an analytical basis for further understanding of promoter physiological functions and studies of gene transcription mechanisms. The source code of DeeProPre is freely available at https://github.com/zzwwmmm/DeeProPre/tree/master.
Collapse
|
18
|
Liu SZ, Xu YC, Tan XY, Zhao T, Zhang DG, Yang H, Luo Z. Transcriptional Regulation and Protein Localization of Zip10, Zip13 and Zip14 Transporters of Freshwater Teleost Yellow Catfish Pelteobagrus fulvidraco Following Zn Exposure in a Heterologous HEK293T Model. Int J Mol Sci 2022; 23:ijms23148034. [PMID: 35887381 PMCID: PMC9321221 DOI: 10.3390/ijms23148034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 07/15/2022] [Accepted: 07/18/2022] [Indexed: 12/04/2022] Open
Abstract
Zip family proteins are involved in the control of zinc (Zn) ion homeostasis. The present study cloned the promoters and investigated the transcription responses and protein subcellular localizations of three LIV-1 subfamily members (zip10, zip13, and zip14) from common freshwater teleost yellow catfish, Pelteobagrus fulvidraco, using in vitro cultured HEK293T model cells. The 2278 bp, 1917 bp, and 1989 bp sequences of zip10, zip13, and zip14 promoters, respectively, were subcloned into pGL3-Basic plasmid for promoter activity analysis. The pcDNA3.1 plasmid coding EGFP tagged pfZip10, pfZip13, and pfZip14 were generated for subsequent confocal microscope analysis. Several potential transcription factors’ binding sites were predicted within the promoters. In vitro promoter analysis in the HEK293T cells showed that high Zn administration significantly reduced the transcriptional activities of the zip10, zip13, and zip14 promoters. The −2017 bp/−2004 bp MRE in the zip10 promoter, the −360 bp/−345 bp MRE in the zip13 promoter, and the −1457 bp/−1442 bp MRE in the zip14 promoter were functional loci that were involved in the regulation of the three zips. The −606 bp/−594 bp KLF4 binding site in the zip13 promoter was a functional locus responsible for zinc-responsive regulation of zip13. The −1383 bp/−1375 bp STAT3 binding site in the zip14 promoter was a functional locus responsible for zinc-responsive regulation of zip14. Moreover, confocal microscope analysis indicated that zinc incubation significantly reduced the fluorescence intensity of pfZip10-EGFP and pfZip14-EGFP but had no significant influence on pfZip13-EGFP fluorescence intensity. Further investigation found that pfZip10 localizes on cell membranes, pfZip14 colocalized with both cell membranes and lysosome, and pfZip13 colocalized with intracellular ER and Golgi. Our research illustrated the transcription regulation of zip10, zip13, and zip14 from P. fulvidraco under zinc administration, which provided a reference value for the mechanisms involved in Zip-family-mediated control of zinc homeostasis in vertebrates.
Collapse
Affiliation(s)
- Sheng-Zan Liu
- Hubei Hongshan Laboratory, Fishery College, Huazhong Agricultural University, Wuhan 430070, China; (S.-Z.L.); (Y.-C.X.); (X.-Y.T.); (T.Z.); (D.-G.Z.); (H.Y.)
| | - Yi-Chuang Xu
- Hubei Hongshan Laboratory, Fishery College, Huazhong Agricultural University, Wuhan 430070, China; (S.-Z.L.); (Y.-C.X.); (X.-Y.T.); (T.Z.); (D.-G.Z.); (H.Y.)
| | - Xiao-Ying Tan
- Hubei Hongshan Laboratory, Fishery College, Huazhong Agricultural University, Wuhan 430070, China; (S.-Z.L.); (Y.-C.X.); (X.-Y.T.); (T.Z.); (D.-G.Z.); (H.Y.)
| | - Tao Zhao
- Hubei Hongshan Laboratory, Fishery College, Huazhong Agricultural University, Wuhan 430070, China; (S.-Z.L.); (Y.-C.X.); (X.-Y.T.); (T.Z.); (D.-G.Z.); (H.Y.)
| | - Dian-Guang Zhang
- Hubei Hongshan Laboratory, Fishery College, Huazhong Agricultural University, Wuhan 430070, China; (S.-Z.L.); (Y.-C.X.); (X.-Y.T.); (T.Z.); (D.-G.Z.); (H.Y.)
| | - Hong Yang
- Hubei Hongshan Laboratory, Fishery College, Huazhong Agricultural University, Wuhan 430070, China; (S.-Z.L.); (Y.-C.X.); (X.-Y.T.); (T.Z.); (D.-G.Z.); (H.Y.)
| | - Zhi Luo
- Hubei Hongshan Laboratory, Fishery College, Huazhong Agricultural University, Wuhan 430070, China; (S.-Z.L.); (Y.-C.X.); (X.-Y.T.); (T.Z.); (D.-G.Z.); (H.Y.)
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, China
- Correspondence: ; Tel.: +86-27-8728-2113; Fax: +86-27-8728-2114
| |
Collapse
|
19
|
BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection. Comput Biol Chem 2022; 99:107732. [PMID: 35863177 DOI: 10.1016/j.compbiolchem.2022.107732] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 07/12/2022] [Indexed: 02/01/2023]
Abstract
A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.
Collapse
|
20
|
Asim MN, Ibrahim MA, Malik MI, Razzak I, Dengel A, Ahmed S. Histone-Net: a multi-paradigm computational framework for histone occupancy and modification prediction. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00802-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
AbstractDeep exploration of histone occupancy and covalent post-translational modifications (e.g., acetylation, methylation) is essential to decode gene expression regulation, chromosome packaging, DNA damage, and transcriptional activation. Existing computational approaches are unable to precisely predict histone occupancy and modifications mainly due to the use of sub-optimal statistical representation of histone sequences. For the establishment of an improved histone occupancy and modification landscape for multiple histone markers, the paper in hand presents an end-to-end computational multi-paradigm framework “Histone-Net”. To learn local and global residue context aware sequence representation, Histone-Net generates unsupervised higher order residue embeddings (DNA2Vec) and presents a different application of language modelling, where it encapsulates histone occupancy and modification information while generating higher order residue embeddings (SuperDNA2Vec) in a supervised manner. We perform an intrinsic and extrinsic evaluation of both presented distributed representation learning schemes. A comprehensive empirical evaluation of Histone-Net over ten benchmark histone markers data sets for three different histone sequence analysis tasks indicates that SuperDNA2Vec sequence representation and softmax classifier-based approach outperforms state-of-the-art approach by an average accuracy of 7%. To eliminate the overhead of training separate binary classifiers for all ten histone markers, Histone-Net is evaluated in multi-label classification paradigm, where it produces decent performance for simultaneous prediction of histone occupancy, acetylation, and methylation.
Collapse
|
21
|
Abstract
As the vital technology of natural language understanding, sentence representation reasoning technology mainly focuses on sentence representation methods and reasoning models. Although the performance has been improved, there are still some problems, such as incomplete sentence semantic expression, lack of depth of reasoning model, and lack of interpretability of the reasoning process. Given the reasoning model’s lack of reasoning depth and interpretability, a deep fusion matching network is designed in this paper, which mainly includes a coding layer, matching layer, dependency convolution layer, information aggregation layer, and inference prediction layer. Based on a deep matching network, the matching layer is improved. Furthermore, the heuristic matching algorithm replaces the bidirectional long-short memory neural network to simplify the interactive fusion. As a result, it improves the reasoning depth and reduces the complexity of the model; the dependency convolution layer uses the tree-type convolution network to extract the sentence structure information along with the sentence dependency tree structure, which improves the interpretability of the reasoning process. Finally, the performance of the model is verified on several datasets. The results show that the reasoning effect of the model is better than that of the shallow reasoning model, and the accuracy rate on the SNLI test set reaches 89.0%. At the same time, the semantic correlation analysis results show that the dependency convolution layer is beneficial in improving the interpretability of the reasoning process.
Collapse
|
22
|
Qiao H, Zhang S, Xue T, Wang J, Wang B. iPro-GAN: A novel model based on generative adversarial learning for identifying promoters and their strength. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 215:106625. [PMID: 35038653 DOI: 10.1016/j.cmpb.2022.106625] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 12/13/2021] [Accepted: 01/06/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Promoter is a component of the gene, which can specifically bind with RNA polymerase and determine where transcription starts, and also determine the transcription efficiency of the gene. Promoters can be divided into strong promoters and weak promoters because their structures and the interaction time interval are quite different. The functional variation of the promoter can lead to a variety of diseases. Therefore, identifying promoters and their strength is necessary and has important biological significance. A novel and promising model based on deep learning is proposed to achieve it. METHODS In this work, we build a power model named iPro-GAN for identification of promoters and their strength. First, we collect benchmark datasets and independent datasets for training and testing. Then, Moran-based spatial auto-cross correlation method is used as feature extraction method. Finally, deep convolution generative adversarial network with 10-fold cross validation is applied for classifying. The first layer of the model is used to identify the promoter and the second layer is used to determine its type. RESULTS On the benchmark data set, the accuracy of the first layer predictor is 93.15%, and the accuracy of the second layer predictor is 92.30%. On the independent data set, the accuracy of the first layer predictor is 86.77%, and the accuracy of the second layer predictor is 91.66%. In particular, breakthrough progress has been made in the identification of promoters' strength. CONCLUSIONS These results are far higher than the existing best predictor, which indicate that our model is serviceable and practicable to identify promoters and their strength. Furthermore, the datasets and source codes are available from this link: https://github.com/Bovbene/iPro-GAN.
Collapse
Affiliation(s)
- Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Jinyue Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Bowei Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
23
|
Exploring Language Markers of Mental Health in Psychiatric Stories. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12042179] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Diagnosing mental disorders is complex due to the genetic, environmental and psychological contributors and the individual risk factors. Language markers for mental disorders can help to diagnose a person. Research thus far on language markers and the associated mental disorders has been done mainly with the Linguistic Inquiry and Word Count (LIWC) program. In order to improve on this research, we employed a range of Natural Language Processing (NLP) techniques using LIWC, spaCy, fastText and RobBERT to analyse Dutch psychiatric interview transcriptions with both rule-based and vector-based approaches. Our primary objective was to predict whether a patient had been diagnosed with a mental disorder, and if so, the specific mental disorder type. Furthermore, the second goal of this research was to find out which words are language markers for which mental disorder. LIWC in combination with the random forest classification algorithm performed best in predicting whether a person had a mental disorder or not (accuracy: 0.952; Cohen’s kappa: 0.889). SpaCy in combination with random forest predicted best which particular mental disorder a patient had been diagnosed with (accuracy: 0.429; Cohen’s kappa: 0.304).
Collapse
|
24
|
Kabir M, Nantasenamat C, Kanthawong S, Charoenkwan P, Shoombuatong W. Large-scale comparative review and assessment of computational methods for phage virion proteins identification. EXCLI JOURNAL 2022; 21:11-29. [PMID: 35145365 PMCID: PMC8822302 DOI: 10.17179/excli2021-4411] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 11/29/2021] [Indexed: 12/11/2022]
Abstract
Phage virion proteins (PVPs) are effective at recognizing and binding to host cell receptors while having no deleterious effects on human or animal cells. Understanding their functional mechanisms is regarded as a critical goal that will aid in rational antibacterial drug discovery and development. Although high-throughput experimental methods for identifying PVPs are considered the gold standard for exploring crucial PVP features, these procedures are frequently time-consuming and labor-intensive. Thusfar, more than ten sequence-based predictors have been established for the in silico identification of PVPs in conjunction with traditional experimental approaches. As a result, a revised and more thorough assessment is extremely desirable. With this purpose in mind, we first conduct a thorough survey and evaluation of a vast array of 13 state-of-the-art PVP predictors. Among these PVP predictors, they can be classified into three groups according to the types of machine learning (ML) algorithms employed (i.e. traditional ML-based methods, ensemble-based methods and deep learning-based methods). Subsequently, we explored which factors are important for building more accurate and stable predictors and this included training/independent datasets, feature encoding algorithms, feature selection methods, core algorithms, performance evaluation metrics/strategies and web servers. Finally, we provide insights and future perspectives for the design and development of new and more effective computational approaches for the detection and characterization of PVPs.
Collapse
Affiliation(s)
- Muhammad Kabir
- School of Systems and Technology, Department of Computer Science, University of Management and Technology, Lahore, Pakistan, 54770
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Sakawrat Kanthawong
- Department of Microbiology, Faculty of Medicine, Khon Kaen University, Khon Kaen, Thailand, 40002
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| |
Collapse
|
25
|
Harigua-Souiai E, Heinhane MM, Abdelkrim YZ, Souiai O, Abdeljaoued-Tej I, Guizani I. Deep Learning Algorithms Achieved Satisfactory Predictions When Trained on a Novel Collection of Anticoronavirus Molecules. Front Genet 2021; 12:744170. [PMID: 34912370 PMCID: PMC8667578 DOI: 10.3389/fgene.2021.744170] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 09/30/2021] [Indexed: 12/26/2022] Open
Abstract
Drug discovery and repurposing against COVID-19 is a highly relevant topic with huge efforts dedicated to delivering novel therapeutics targeting SARS-CoV-2. In this context, computer-aided drug discovery is of interest in orienting the early high throughput screenings and in optimizing the hit identification rate. We herein propose a pipeline for Ligand-Based Drug Discovery (LBDD) against SARS-CoV-2. Through an extensive search of the literature and multiple steps of filtering, we integrated information on 2,610 molecules having a validated effect against SARS-CoV and/or SARS-CoV-2. The chemical structures of these molecules were encoded through multiple systems to be readily useful as input to conventional machine learning (ML) algorithms or deep learning (DL) architectures. We assessed the performances of seven ML algorithms and four DL algorithms in achieving molecule classification into two classes: active and inactive. The Random Forests (RF), Graph Convolutional Network (GCN), and Directed Acyclic Graph (DAG) models achieved the best performances. These models were further optimized through hyperparameter tuning and achieved ROC-AUC scores through cross-validation of 85, 83, and 79% for RF, GCN, and DAG models, respectively. An external validation step on the FDA-approved drugs collection revealed a superior potential of DL algorithms to achieve drug repurposing against SARS-CoV-2 based on the dataset herein presented. Namely, GCN and DAG achieved more than 50% of the true positive rate assessed on the confirmed hits of a PubChem bioassay.
Collapse
Affiliation(s)
- Emna Harigua-Souiai
- Laboratory of Molecular Epidemiology and Experimental Pathology-LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| | - Mohamed Mahmoud Heinhane
- Laboratory of Molecular Epidemiology and Experimental Pathology-LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| | - Yosser Zina Abdelkrim
- Laboratory of Molecular Epidemiology and Experimental Pathology-LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| | - Oussama Souiai
- Laboratory of BioInformatics BioMathematics and BioStatistics (BIMS)-LR20IPT09, Institut Pasteur de Tunis, University of Tunis El Manar, Tunis, Tunisia
| | - Ines Abdeljaoued-Tej
- Laboratory of BioInformatics BioMathematics and BioStatistics (BIMS)-LR20IPT09, Institut Pasteur de Tunis, University of Tunis El Manar, Tunis, Tunisia
- Engineering School of Statistics and Information Analysis, University of Carthage, Ariana, Tunisia
| | - Ikram Guizani
- Laboratory of Molecular Epidemiology and Experimental Pathology-LR16IPT04, Institut Pasteur de Tunis, Université de Tunis El Manar, Tunis, Tunisia
| |
Collapse
|
26
|
Kim J, Jeong SY, Kim BC, Byun BH, Lim I, Kong CB, Song WS, Lim SM, Woo SK. Prediction of Neoadjuvant Chemotherapy Response in Osteosarcoma Using Convolutional Neural Network of Tumor Center 18F-FDG PET Images. Diagnostics (Basel) 2021; 11:diagnostics11111976. [PMID: 34829324 PMCID: PMC8617812 DOI: 10.3390/diagnostics11111976] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 10/14/2021] [Accepted: 10/20/2021] [Indexed: 12/24/2022] Open
Abstract
We compared the accuracy of prediction of the response to neoadjuvant chemotherapy (NAC) in osteosarcoma patients between machine learning approaches of whole tumor utilizing fluorine−18fluorodeoxyglucose (18F-FDG) uptake heterogeneity features and a convolutional neural network of the intratumor image region. In 105 patients with osteosarcoma, 18F-FDG positron emission tomography/computed tomography (PET/CT) images were acquired before (baseline PET0) and after NAC (PET1). Patients were divided into responders and non-responders about neoadjuvant chemotherapy. Quantitative 18F-FDG heterogeneity features were calculated using LIFEX version 4.0. Receiver operating characteristic (ROC) curve analysis of 18F-FDG uptake heterogeneity features was used to predict the response to NAC. Machine learning algorithms and 2-dimensional convolutional neural network (2D CNN) deep learning networks were estimated for predicting NAC response with the baseline PET0 images of the 105 patients. ML was performed using the entire tumor image. The accuracy of the 2D CNN prediction model was evaluated using total tumor slices, the center 20 slices, the center 10 slices, and center slice. A total number of 80 patients was used for k-fold validation by five groups with 16 patients. The CNN network test accuracy estimation was performed using 25 patients. The areas under the ROC curves (AUCs) for baseline PET maximum standardized uptake value (SUVmax), total lesion glycolysis (TLG), metabolic tumor volume (MTV), and gray level size zone matrix (GLSZM) were 0.532, 0.507, 0.510, and 0.626, respectively. The texture features test accuracy of machine learning by random forest and support vector machine were 0.55 and 0. 54, respectively. The k-fold validation accuracy and validation accuracy were 0.968 ± 0.01 and 0.610 ± 0.04, respectively. The test accuracy of total tumor slices, the center 20 slices, center 10 slices, and center slices were 0.625, 0.616, 0.628, and 0.760, respectively. The prediction model for NAC response with baseline PET0 texture features machine learning estimated a poor outcome, but the 2D CNN network using 18F-FDG baseline PET0 images could predict the treatment response before prior chemotherapy in osteosarcoma. Additionally, using the 2D CNN prediction model using a tumor center slice of 18F-FDG PET images before NAC can help decide whether to perform NAC to treat osteosarcoma patients.
Collapse
Affiliation(s)
- Jingyu Kim
- Radiological & Medico-Oncological Sciences, University of Science & Technology, Seoul 34113, Korea;
| | - Su Young Jeong
- College of Medicine, University of Ulsan, Seoul 05505, Korea;
| | - Byung-Chul Kim
- Department of Nuclear Medicine, Korea Institute of Radiology and Medical Sciences, Seoul 01812, Korea; (B.-C.K.); (B.-H.B.); (I.L.); (S.M.L.)
| | - Byung-Hyun Byun
- Department of Nuclear Medicine, Korea Institute of Radiology and Medical Sciences, Seoul 01812, Korea; (B.-C.K.); (B.-H.B.); (I.L.); (S.M.L.)
| | - Ilhan Lim
- Department of Nuclear Medicine, Korea Institute of Radiology and Medical Sciences, Seoul 01812, Korea; (B.-C.K.); (B.-H.B.); (I.L.); (S.M.L.)
| | - Chang-Bae Kong
- Department of Orthopedic Surgery, Korea Institute of Radiology and Medical Sciences, Seoul 01812, Korea; (C.-B.K.); (W.S.S.)
| | - Won Seok Song
- Department of Orthopedic Surgery, Korea Institute of Radiology and Medical Sciences, Seoul 01812, Korea; (C.-B.K.); (W.S.S.)
| | - Sang Moo Lim
- Department of Nuclear Medicine, Korea Institute of Radiology and Medical Sciences, Seoul 01812, Korea; (B.-C.K.); (B.-H.B.); (I.L.); (S.M.L.)
| | - Sang-Keun Woo
- Radiological & Medico-Oncological Sciences, University of Science & Technology, Seoul 34113, Korea;
- Department of Nuclear Medicine, Korea Institute of Radiology and Medical Sciences, Seoul 01812, Korea; (B.-C.K.); (B.-H.B.); (I.L.); (S.M.L.)
- Correspondence:
| |
Collapse
|
27
|
MacPhillamy C, Pitchford WS, Alinejad-Rokny H, Low WY. Opportunity to improve livestock traits using 3D genomics. Anim Genet 2021; 52:785-798. [PMID: 34494283 DOI: 10.1111/age.13135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/24/2021] [Indexed: 11/30/2022]
Abstract
The advent of high-throughput chromosome conformation capture and sequencing (Hi-C) has enabled researchers to probe the 3D architecture of the mammalian genome in a genome-wide manner. Simultaneously, advances in epigenomic assays, such as chromatin immunoprecipitation and sequencing (ChIP-seq) and DNase-seq, have enabled researchers to study cis-regulatory interactions and chromatin accessibility across the same genome-wide scale. The use of these data has revealed many unique insights into gene regulation and disease pathomechanisms in several model organisms. With the advent of these high-throughput sequencing technologies, there has been an ever-increasing number of datasets available for study; however, this is often limited to model organisms. Livestock species play critical roles in the economies of developing and developed nations alike. Despite this, they are greatly underrepresented in the 3D genomics space; Hi-C and related technologies have the potential to revolutionise livestock breeding by enabling a more comprehensive understanding of how production traits are controlled. The growth in human and model organism Hi-C data has seen a surge in the availability of computational tools for use in 3D genomics, with some tools using machine learning techniques to predict features and improve dataset quality. In this review, we provide an overview of the 3D genome and discuss the status of 3D genomics in livestock before delving into advancing the field by drawing inspiration from research in human and mouse. We end by offering future directions for livestock research in the field of 3D genomics.
Collapse
Affiliation(s)
- C MacPhillamy
- Davies Livestock Research Centre, The University of Adelaide, Roseworthy Campus, Mudla Wirra Rd, Roseworthy, SA, 5371, Australia
| | - W S Pitchford
- Davies Livestock Research Centre, The University of Adelaide, Roseworthy Campus, Mudla Wirra Rd, Roseworthy, SA, 5371, Australia
| | - H Alinejad-Rokny
- Biological & Medical Machine Learning Lab, The Graduate School of Biomedical Engineering, UNSW Sydney, Sydney, NSW, 2052, Australia.,School of Computer Science and Engineering, The University of New South Wales (UNSW Sydney), Sydney, NSW, 2052, Australia
| | - W Y Low
- Davies Livestock Research Centre, The University of Adelaide, Roseworthy Campus, Mudla Wirra Rd, Roseworthy, SA, 5371, Australia
| |
Collapse
|
28
|
Zhao H, Zhang J, Fu X, Mao D, Qi X, Liang S, Meng G, Song Z, Yang R, Guo Z, Tong B, Sun M, Zuo B, Li G. Integrated bioinformatics analysis of the NEDD4 family reveals a prognostic value of NEDD4L in clear-cell renal cell cancer. PeerJ 2021; 9:e11880. [PMID: 34458018 PMCID: PMC8378337 DOI: 10.7717/peerj.11880] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 07/07/2021] [Indexed: 12/20/2022] Open
Abstract
The members of the Nedd4-like E3 family participate in various biological processes. However, their role in clear cell renal cell carcinoma (ccRCC) is not clear. This study systematically analyzed the Nedd4-like E3 family members in ccRCC data sets from multiple publicly available databases. NEDD4L was identified as the only NEDD4 family member differentially expressed in ccRCC compared with normal samples. Bioinformatics tools were used to characterize the function of NEDD4L in ccRCC. It indicated that NEDD4L might regulate cellular energy metabolism by co-expression analysis, and subsequent gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. A prognostic model developed by the LASSO Cox regression method showed a relatively good predictive value in training and testing data sets. The result revealed that NEDD4L was associated with biosynthesis and metabolism of ccRCC. Since NEDD4L is downregulated and dysregulation of metabolism is involved in tumor progression, NEDD4L might be a potential therapeutic target in ccRCC.
Collapse
Affiliation(s)
- Hui Zhao
- Department of Urology, Affiliated Hospital of Weifang Medical University, Weifang, China.,Department of Urology, China Rehabilitation Research Centre, Rehabilitation School of Capital Medical University, Beijing, China
| | - Junjun Zhang
- Department of Oncology, The Third Xiangya Hospital of Central South University, Changsha, China
| | - Xiaoliang Fu
- Department of Urology, The Second Affiliated Hospital of Air Force Medical University, Xian, China
| | - Dongdong Mao
- Department of Urology, Affiliated Hospital of Weifang Medical University, Weifang, China
| | - Xuesen Qi
- Department of Urology, Affiliated Hospital of Weifang Medical University, Weifang, China
| | - Shuai Liang
- Department of Urology, Affiliated Hospital of Weifang Medical University, Weifang, China
| | - Gang Meng
- Department of Urology, Affiliated Hospital of Weifang Medical University, Weifang, China
| | - Zewen Song
- Department of Oncology, The Third Xiangya Hospital of Central South University, Changsha, China
| | - Ru Yang
- Henan Key Laboratory of Neurorestoratology, The First Affliated Hospital of Xinxiang Medical University, Weihui, China
| | - Zhenni Guo
- College of Life Science and Agronomy, Zhoukou Normal University, Zhoukou, China
| | - Binghua Tong
- College of Life Science and Agronomy, Zhoukou Normal University, Zhoukou, China
| | - Meiqing Sun
- College of Life Science and Agronomy, Zhoukou Normal University, Zhoukou, China
| | - Baile Zuo
- Tumor Molecular Immunology and Immunotherapy Laboratory, School of Laboratory Medicine, Xinxiang Medical University, Xinxiang, China
| | - Guoyin Li
- College of Life Science and Agronomy, Zhoukou Normal University, Zhoukou, China.,Academy of Medical Science, Zhengzhou University, Zhengzhou, China
| |
Collapse
|
29
|
Chen C, Shi H, Jiang Z, Salhi A, Chen R, Cui X, Yu B. DNN-DTIs: Improved drug-target interactions prediction using XGBoost feature selection and deep neural network. Comput Biol Med 2021; 136:104676. [PMID: 34375902 DOI: 10.1016/j.compbiomed.2021.104676] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Revised: 07/18/2021] [Accepted: 07/19/2021] [Indexed: 02/03/2023]
Abstract
Analysis and prediction of drug-target interactions (DTIs) play an important role in understanding drug mechanisms, as well as drug repositioning and design. Machine learning (ML)-based methods for DTIs prediction can mitigate the shortcomings of time-consuming and labor-intensive experimental approaches, while providing new ideas and insights for drug design. We propose a novel pipeline for predicting drug-target interactions, called DNN-DTIs. First, the target information is characterized by a number of features, namely, pseudo-amino acid composition, pseudo position-specific scoring matrix, conjoint triad composition, transition and distribution, Moreau-Broto autocorrelation, and structural features. The drug compounds are subsequently encoded using substructure fingerprints. Next, eXtreme gradient boosting (XGBoost) is used to determine the subset of non-redundant features of importance. The optimal balanced set of sample vectors is obtained by applying the synthetic minority oversampling technique (SMOTE). Finally, a DTIs predictor, DNN-DTIs, is developed based on a deep neural network (DNN) via a layer-by-layer learning scheme. Experimental results indicate that DNN-DTIs achieves better performance than other state-of-the-art predictors with ACC values of 98.78%, 98.60%, 97.98%, 98.24% and 98.00% on Enzyme, Ion Channels (IC), GPCR, Nuclear Receptors (NR) and Kuang's datasets. Therefore, the accurate prediction performance of DNN-DTIs makes it a favored choice for contributing to the study of DTIs, especially drug repositioning.
Collapse
Affiliation(s)
- Cheng Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Computer Science and Technology, Shandong University, Qingdao, 266237, China
| | - Han Shi
- Key Laboratory of Synthetic Biology, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, 200032, China
| | - Zhiwen Jiang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Adil Salhi
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Ruixin Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Xuefeng Cui
- School of Computer Science and Technology, Shandong University, Qingdao, 266237, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
| |
Collapse
|
30
|
Challa AP, Zaleski NM, Jerome RN, Lavieri RR, Shirey-Rice JK, Barnado A, Lindsell CJ, Aronoff DM, Crofford LJ, Harris RC, Alp Ikizler T, Mayer IA, Holroyd KJ, Pulley JM. Human and Machine Intelligence Together Drive Drug Repurposing in Rare Diseases. Front Genet 2021; 12:707836. [PMID: 34394194 PMCID: PMC8355705 DOI: 10.3389/fgene.2021.707836] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 07/06/2021] [Indexed: 01/31/2023] Open
Abstract
Repurposing is an increasingly attractive method within the field of drug development for its efficiency at identifying new therapeutic opportunities among approved drugs at greatly reduced cost and time of more traditional methods. Repurposing has generated significant interest in the realm of rare disease treatment as an innovative strategy for finding ways to manage these complex conditions. The selection of which agents should be tested in which conditions is currently informed by both human and machine discovery, yet the appropriate balance between these approaches, including the role of artificial intelligence (AI), remains a significant topic of discussion in drug discovery for rare diseases and other conditions. Our drug repurposing team at Vanderbilt University Medical Center synergizes machine learning techniques like phenome-wide association study-a powerful regression method for generating hypotheses about new indications for an approved drug-with the knowledge and creativity of scientific, legal, and clinical domain experts. While our computational approaches generate drug repurposing hits with a high probability of success in a clinical trial, human knowledge remains essential for the hypothesis creation, interpretation, "go-no go" decisions with which machines continue to struggle. Here, we reflect on our experience synergizing AI and human knowledge toward realizable patient outcomes, providing case studies from our portfolio that inform how we balance human knowledge and machine intelligence for drug repurposing in rare disease.
Collapse
Affiliation(s)
- Anup P. Challa
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, TN, United States
| | - Nicole M. Zaleski
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Rebecca N. Jerome
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Robert R. Lavieri
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Jana K. Shirey-Rice
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, United States
| | - April Barnado
- Division of Rheumatology and Immunology, Department of Medicine, Vanderbilt Medical Center, Nashville, TN, United States
| | - Christopher J. Lindsell
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - David M. Aronoff
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Obstetrics and Gynecology, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Leslie J. Crofford
- Division of Rheumatology and Immunology, Department of Medicine, Vanderbilt Medical Center, Nashville, TN, United States
| | - Raymond C. Harris
- Division of Nephrology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States
| | - T. Alp Ikizler
- Division of Nephrology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Ingrid A. Mayer
- Division of Hematology/Oncology, Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Kenneth J. Holroyd
- Center for Technology Transfer and Commercialization, Vanderbilt University, Nashville, TN, United States
| | - Jill M. Pulley
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
31
|
Zhou YH, Saghapour E. ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data. Front Genet 2021; 12:691274. [PMID: 34276792 PMCID: PMC8283820 DOI: 10.3389/fgene.2021.691274] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Accepted: 05/25/2021] [Indexed: 11/13/2022] Open
Abstract
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Science, North Carolina State University, Raleigh, NC, United States
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| | - Ehsan Saghapour
- Department of Biological Science, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
32
|
Lv H, Dao FY, Zulfiqar H, Lin H. DeepIPs: comprehensive assessment and computational identification of phosphorylation sites of SARS-CoV-2 infection using a deep learning-based approach. Brief Bioinform 2021; 22:6310410. [PMID: 34184738 PMCID: PMC8406875 DOI: 10.1093/bib/bbab244] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 05/18/2020] [Accepted: 06/03/2021] [Indexed: 11/14/2022] Open
Abstract
The rapid spread of SARS-CoV-2 infection around the globe has caused a massive health and socioeconomic crisis. Identification of phosphorylation sites is an important step for understanding the molecular mechanisms of SARS-CoV-2 infection and the changes within the host cells pathways. In this study, we present DeepIPs, a first specific deep-learning architecture to identify phosphorylation sites in host cells infected with SARS-CoV-2. DeepIPs consists of the most popular word embedding method and convolutional neural network-long short-term memory network architecture to make the final prediction. The independent test demonstrates that DeepIPs improves the prediction performance compared with other existing tools for general phosphorylation sites prediction. Based on the proposed model, a web-server called DeepIPs was established and is freely accessible at http://lin-group.cn/server/DeepIPs. The source code of DeepIPs is freely available at the repository https://github.com/linDing-group/DeepIPs.
Collapse
Affiliation(s)
- Hao Lv
- Center for Informational Biology at the University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Center for Informational Biology at the University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hasan Zulfiqar
- Center for Informational Biology at the University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lin
- Center for Informational Biology at the University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
33
|
Chen Z, Wang X, Wang G, Xiao B, Ma Z, Huo H, Li W. A seven-lncRNA signature for predicting Ewing's sarcoma. PeerJ 2021; 9:e11599. [PMID: 34178467 PMCID: PMC8214847 DOI: 10.7717/peerj.11599] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Accepted: 05/21/2021] [Indexed: 01/17/2023] Open
Abstract
Background Long non-coding RNAs (lncRNAs) are a class of non-coding RNAs with unique characteristics. These RNA can regulate cancer cells’ survival, proliferation, invasion, metastasis, and angiogenesis and are potential diagnostic and prognostic markers. We identified a seven-lncRNA signature related to the overall survival (OS) of patients with Ewing’s sarcoma (EWS). Methods We used an expression profile from the Gene Expression Omnibus (GEO) database as a training cohort to screen out the OS-associated lncRNAs in EWS and further established a seven-lncRNA signature using univariate Cox regression, the least absolute shrinkage, and selection operator (LASSO) regression analysis. The prognostic lncRNA signature was validated in an external dataset from the International Cancer Genome Consortium (ICGC) as a validation cohort. Results We obtained 10 survival-related lncRNAs from the Kaplan-Meier and ROC curve analysis (log-rank test P < 0.05; AUC >0.6). Univariate Cox regression and LASSO regression analyses confirmed seven key lncRNAs and we established a lncRNA signature to predict an EWS prognosis. EWS patients in the training cohort were categorized into a low-risk group or a high-risk group based on their median risk score. The high-risk group’s survival time was significantly shorter than the low-risk group’s. This seven-lncRNA signature was further confirmed by the validation cohort. The area under the curve (AUC) for this lncRNA signature was up to 0.905 in the training group and 0.697 in the 3-year validation group. The nomogram’s calibration curves demonstrated that EWS probability in the two cohorts was consistent between the nomogram prediction and actual observation. Conclusion We screened a seven-lncRNA signature to predict the EWS patients’ prognosis. Our findings provide a new reference for the current prognostic evaluation of EWS and new direction for the diagnosis and treatment of EWS.
Collapse
Affiliation(s)
- Zhihui Chen
- Department of Orthopedics, Second Affiliated Hospital of Shaanxi University of Traditional Chinese Medicine, Xianyang, Shaanxi, China
| | - Xinyu Wang
- Department of Preventive Medicine, School of Public Health, Nanchang University, Nanchang, Jiangxi, China
| | - Guozhu Wang
- Department of Orthopedics, Second Affiliated Hospital of Shaanxi University of Traditional Chinese Medicine, Xianyang, Shaanxi, China
| | - Bin Xiao
- Department of Orthopedics, Second Affiliated Hospital of Shaanxi University of Traditional Chinese Medicine, Xianyang, Shaanxi, China
| | - Zhe Ma
- Department of Orthopedics, Second Affiliated Hospital of Shaanxi University of Traditional Chinese Medicine, Xianyang, Shaanxi, China
| | - Hongliang Huo
- Department of Orthopedics, Second Affiliated Hospital of Shaanxi University of Traditional Chinese Medicine, Xianyang, Shaanxi, China
| | - Weiwei Li
- Department of Orthopedics, Second Affiliated Hospital of Shaanxi University of Traditional Chinese Medicine, Xianyang, Shaanxi, China
| |
Collapse
|
34
|
Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J 2021; 19:3198-3208. [PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/10/2021] [Accepted: 05/20/2021] [Indexed: 12/16/2022] Open
Abstract
Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
Collapse
Affiliation(s)
- Hitoshi Iuchi
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
| | - Taro Matsutani
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Keisuke Yamada
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Natsuki Iwano
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shunsuke Sumi
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
| | - Shion Hosoda
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan
- Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
| | - Michiaki Hamada
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
35
|
Zhang S, Liu C, Zou X, Geng X, Zhou X, Fan X, Zhu D, Zhang H, Zhu W. MicroRNA panel in serum reveals novel diagnostic biomarkers for prostate cancer. PeerJ 2021; 9:e11441. [PMID: 34055487 PMCID: PMC8141284 DOI: 10.7717/peerj.11441] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 04/21/2021] [Indexed: 12/17/2022] Open
Abstract
Purpose MicroRNAs (miRNAs), which could be stably preserved and detected in serum or plasma, could act as biomarkers in cancer diagnosis. Prostate cancer is the second cancer in males for incidence. This study aimed to establish a miRNA panel in peripheral serum which could act as a non-invasive biomarker helping diagnosing PC. Methods A total of 86 PC patients and 86 normal control serum samples were analyzed through a four-stage experimental process using quantitative real-time polymerase chain reaction. Logistic regression method was used to construct a diagnostic model based on the differentially expressed miRNAs in serum. Receiver operating characteristic curves were constructed to evaluate the diagnostic accuracy. We also compared the 3-miRNA panel with previously reported biomarkers and verified in four public datasets. In addition, the expression characteristics of the identified miRNAs were further explored in tissue and serum exosomes samples. Results We identified a 3-miRNA signature including up-regulated miR-146a-5p, miR-24-3p and miR-93-5p for PC detection. Areas under the receiver operating characteristic curve of the 3-miRNA panel for the training, testing and external validation phase were 0.819, 0.831 and 0.814, respectively. The identified signature has a very stable diagnostic performance in the large cohorts of four public datasets. Compared with previously identified miRNA biomarkers, the 3-miRNA signature in this study has superior performance in diagnosing PC. What’s more, the expression level of miR-93-5p was also elevated in exosomes from PC samples. However, in PC tissues, none of the three miRNAs showed significantly dysregulated expression. Conclusions We established a three-miRNA panel (miR-146a-5p, miR-24-3p and miR-93-5p) in peripheral serum which could act as a non-invasive biomarker helping diagnosing PC.
Collapse
Affiliation(s)
- Shiyu Zhang
- Department of Oncology, First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu Province, China
| | - Cheng Liu
- Department of Gastroenterology, First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu Province, China
| | - Xuan Zou
- Fudan University Shanghai Cancer Center, Fudan University Shanghai Cancer Center, Shanghai, Shanghai, China
| | - Xiangnan Geng
- Department of Clinical Engineer, First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu Province, China
| | - Xin Zhou
- Department of Oncology, First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu Province, China
| | - XingChen Fan
- Department of Oncology, First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu Province, China
| | - Danxia Zhu
- Department of Oncology, The Third Affiliated Hospital of Soochow University, Changzhou, Jiangsu Province, China
| | - Huo Zhang
- Department of Oncology, Northern Jiangsu People's Hospital Affiliated to Yangzhou University, Yangzhou, Jiangsu Province, China
| | - Wei Zhu
- Department of Oncology, First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu Province, China
| |
Collapse
|
36
|
El Boujnouni H, Rahouti M, El Boujnouni M. Identification of SARS-CoV-2 origin: Using Ngrams, principal component analysis and Random Forest algorithm. INFORMATICS IN MEDICINE UNLOCKED 2021; 24:100577. [PMID: 33898732 PMCID: PMC8056990 DOI: 10.1016/j.imu.2021.100577] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Revised: 04/12/2021] [Accepted: 04/13/2021] [Indexed: 01/22/2023] Open
Abstract
COVID-19 is an infectious disease caused by the newly discovered SARS-CoV-2 virus. This virus causes a respiratory tract infection, symptoms include dry cough, fever, tiredness and in more severe cases, breathing difficulty. SARS-CoV-2 is an extremely contagious virus that is spreading rapidly all over the world and the scientific community is working tirelessly to find an effective treatment. This paper aims to determine the origin of this virus by comparing its nucleic acid sequence with all members of the coronaviridae family. This study uses a new approach based on the combination of three powerful techniques which are: Ngrams (For text categorization), Principal Component Analysis (For dimensionality reduction) and Random Forest algorithm (For supervised classification). The experimental results have shown that a large set of SARS-CoV-2 genomes, collected from different locations around the world, present significant similarities to those found in pangolins. This finding confirms some previous results obtained by other methods, which also suggest that pangolins should be considered as possible hosts in the emergence of the new coronavirus.
Collapse
Affiliation(s)
- Hamoucha El Boujnouni
- Research Center of Plant and Microbial Biotechnologies, Biodiversity, and Environment, Faculty of Sciences, Mohammed V University in Rabat, PO Box 1014, Morocco
| | - Mohamed Rahouti
- Research Center of Plant and Microbial Biotechnologies, Biodiversity, and Environment, Faculty of Sciences, Mohammed V University in Rabat, PO Box 1014, Morocco
| | - Mohamed El Boujnouni
- Laboratory of Information Technologies, National School of Applied Sciences, Chouaib Doukkali University in El Jadida, PO Box 1166, Morocco
| |
Collapse
|
37
|
MET Exon 14 Skipping: A Case Study for the Detection of Genetic Variants in Cancer Driver Genes by Deep Learning. Int J Mol Sci 2021; 22:ijms22084217. [PMID: 33921709 PMCID: PMC8072630 DOI: 10.3390/ijms22084217] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Revised: 04/13/2021] [Accepted: 04/17/2021] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Disruption of alternative splicing (AS) is frequently observed in cancer and might represent an important signature for tumor progression and therapy. Exon skipping (ES) represents one of the most frequent AS events, and in non-small cell lung cancer (NSCLC) MET exon 14 skipping was shown to be targetable. METHODS We constructed neural networks (NN/CNN) specifically designed to detect MET exon 14 skipping events using RNAseq data. Furthermore, for discovery purposes we also developed a sparsely connected autoencoder to identify uncharacterized MET isoforms. RESULTS The neural networks had a Met exon 14 skipping detection rate greater than 94% when tested on a manually curated set of 690 TCGA bronchus and lung samples. When globally applied to 2605 TCGA samples, we observed that the majority of false positives was characterized by a blurry coverage of exon 14, but interestingly they share a common coverage peak in the second intron and we speculate that this event could be the transcription signature of a LINE1 (Long Interspersed Nuclear Element 1)-MET (Mesenchymal Epithelial Transition receptor tyrosine kinase) fusion. CONCLUSIONS Taken together, our results indicate that neural networks can be an effective tool to provide a quick classification of pathological transcription events, and sparsely connected autoencoders could represent the basis for the development of an effective discovery tool.
Collapse
|
38
|
A Cascade Graph Convolutional Network for Predicting Protein-Ligand Binding Affinity. Int J Mol Sci 2021; 22:ijms22084023. [PMID: 33919681 PMCID: PMC8070477 DOI: 10.3390/ijms22084023] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2021] [Revised: 04/06/2021] [Accepted: 04/08/2021] [Indexed: 11/17/2022] Open
Abstract
Accurate prediction of binding affinity between protein and ligand is a very important step in the field of drug discovery. Although there are many methods based on different assumptions and rules do exist, prediction performance of protein-ligand binding affinity is not satisfactory so far. This paper proposes a new cascade graph-based convolutional neural network architecture by dealing with non-Euclidean irregular data. We represent the molecule as a graph, and use a simple linear transformation to deal with the sparsity problem of the one-hot encoding of original data. The first stage adopts ARMA graph convolutional neural network to learn the characteristics of atomic space in the protein-ligand complex. In the second stage, one variant of the MPNN graph convolutional neural network is introduced with chemical bond information and interactive atomic features. Finally, the architecture passes through the global add pool and the fully connected layer, and outputs a constant value as the predicted binding affinity. Experiments on the PDBbind v2016 data set showed that our method is better than most of the current methods. Our method is also comparable to the state-of-the-art method on the data set, and is more intuitive and simple.
Collapse
|
39
|
Liu Z, Chen Q, Lan W, Pan H, Hao X, Pan S. GADTI: Graph Autoencoder Approach for DTI Prediction From Heterogeneous Network. Front Genet 2021; 12:650821. [PMID: 33912218 PMCID: PMC8072283 DOI: 10.3389/fgene.2021.650821] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Accepted: 03/12/2021] [Indexed: 12/26/2022] Open
Abstract
Identifying drug–target interaction (DTI) is the basis for drug development. However, the method of using biochemical experiments to discover drug-target interactions has low coverage and high costs. Many computational methods have been developed to predict potential drug-target interactions based on known drug-target interactions, but the accuracy of these methods still needs to be improved. In this article, a graph autoencoder approach for DTI prediction (GADTI) was proposed to discover potential interactions between drugs and targets using a heterogeneous network, which integrates diverse drug-related and target-related datasets. Its encoder consists of two components: a graph convolutional network (GCN) and a random walk with restart (RWR). And the decoder is DistMult, a matrix factorization model, using embedding vectors from encoder to discover potential DTIs. The combination of GCN and RWR can provide nodes with more information through a larger neighborhood, and it can also avoid over-smoothing and computational complexity caused by multi-layer message passing. Based on the 10-fold cross-validation, we conduct three experiments in different scenarios. The results show that GADTI is superior to the baseline methods in both the area under the receiver operator characteristic curve and the area under the precision–recall curve. In addition, based on the latest Drugbank dataset (V5.1.8), the case study shows that 54.8% of new approved DTIs are predicted by GADTI.
Collapse
Affiliation(s)
- Zhixian Liu
- School of Medical, Guangxi University, Nanning, China.,School of Electronics and Information Engineering, Beibu Gulf University, Qinzhou, China
| | - Qingfeng Chen
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Wei Lan
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Haiming Pan
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Xinkun Hao
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Shirui Pan
- Department of Data Science and AI, Monash University, Melbourne, VIC, Australia
| |
Collapse
|
40
|
Rampelli S, Fabbrini M, Candela M, Biagi E, Brigidi P, Turroni S. G2S: A New Deep Learning Tool for Predicting Stool Microbiome Structure From Oral Microbiome Data. Front Genet 2021; 12:644516. [PMID: 33897763 PMCID: PMC8062976 DOI: 10.3389/fgene.2021.644516] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 03/09/2021] [Indexed: 12/15/2022] Open
Abstract
Deep learning methodologies have revolutionized prediction in many fields and show the potential to do the same in microbial metagenomics. However, deep learning is still unexplored in the field of microbiology, with only a few software designed to work with microbiome data. Within the meta-community theory, we foresee new perspectives for the development and application of deep learning algorithms in the field of the human microbiome. In this context, we developed G2S, a bioinformatic tool for taxonomic prediction of the human fecal microbiome directly from the oral microbiome data of the same individual. The tool uses a deep convolutional neural network trained on paired oral and fecal samples from populations across the globe, which allows inferring the stool microbiome at the family level more accurately than other available approaches. The tool can be used in retrospective studies, where fecal sampling was not performed, and especially in the field of paleomicrobiology, as a unique opportunity to recover data related to ancient gut microbiome configurations. G2S was validated on already characterized oral and fecal sample pairs, and then applied to ancient microbiome data from dental calculi, to derive putative intestinal components in medieval subjects.
Collapse
Affiliation(s)
- Simone Rampelli
- Unit of Microbiome Science and Biotechnology, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Marco Fabbrini
- Unit of Microbiome Science and Biotechnology, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy.,Department of Medical and Surgical Sciences, University of Bologna, Bologna, Italy
| | - Marco Candela
- Unit of Microbiome Science and Biotechnology, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Elena Biagi
- Unit of Microbiome Science and Biotechnology, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Patrizia Brigidi
- Department of Medical and Surgical Sciences, University of Bologna, Bologna, Italy
| | - Silvia Turroni
- Unit of Microbiome Science and Biotechnology, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| |
Collapse
|
41
|
Mitrea D, Badea R, Mitrea P, Brad S, Nedevschi S. Hepatocellular Carcinoma Automatic Diagnosis within CEUS and B-Mode Ultrasound Images Using Advanced Machine Learning Methods. SENSORS (BASEL, SWITZERLAND) 2021; 21:2202. [PMID: 33801125 PMCID: PMC8004125 DOI: 10.3390/s21062202] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/21/2021] [Revised: 03/12/2021] [Accepted: 03/16/2021] [Indexed: 02/06/2023]
Abstract
Hepatocellular Carcinoma (HCC) is the most common malignant liver tumor, being present in 70% of liver cancer cases. It usually evolves on the top of the cirrhotic parenchyma. The most reliable method for HCC diagnosis is the needle biopsy, which is an invasive, dangerous method. In our research, specific techniques for non-invasive, computerized HCC diagnosis are developed, by exploiting the information from ultrasound images. In this work, the possibility of performing the automatic diagnosis of HCC within B-mode ultrasound and Contrast-Enhanced Ultrasound (CEUS) images, using advanced machine learning methods based on Convolutional Neural Networks (CNN), was assessed. The recognition performance was evaluated separately on B-mode ultrasound images and on CEUS images, respectively, as well as on combined B-mode ultrasound and CEUS images. For this purpose, we considered the possibility of combining the input images directly, performing feature level fusion, then providing the resulted data at the entrances of representative CNN classifiers. In addition, several multimodal combined classifiers were experimented, resulted by the fusion, at classifier, respectively, at the decision levels of two different branches based on the same CNN architecture, as well as on different CNN architectures. Various combination methods, and also the dimensionality reduction method of Kernel Principal Component Analysis (KPCA), were involved in this process. These results were compared with those obtained on the same dataset, when employing advanced texture analysis techniques in conjunction with conventional classification methods and also with equivalent state-of-the-art approaches. An accuracy above 97% was achieved when our new methodology was applied.
Collapse
Affiliation(s)
- Delia Mitrea
- Department of Computer Science, Faculty of Automation and Computer Science, Technical University of Cluj-Napoca, Baritiu Street, No. 26-28, 400027 Cluj-Napoca, Romania; (D.M.); (P.M.); (S.N.)
| | - Radu Badea
- Medical Imaging Department, Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca, Babes Street, No. 8, 400012 Cluj-Napoca, Romania;
- Regional Institute of Gastroenterology and Hepatology, Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca, 19-21 Croitorilor Street, 400162 Cluj-Napoca, Romania
| | - Paulina Mitrea
- Department of Computer Science, Faculty of Automation and Computer Science, Technical University of Cluj-Napoca, Baritiu Street, No. 26-28, 400027 Cluj-Napoca, Romania; (D.M.); (P.M.); (S.N.)
| | - Stelian Brad
- Department of Design Engineering and Robotics, Faculty of Machine Building, Technical University of Cluj-Napoca, Muncii Boulevard, No. 103-105, 400641 Cluj-Napoca, Romania
| | - Sergiu Nedevschi
- Department of Computer Science, Faculty of Automation and Computer Science, Technical University of Cluj-Napoca, Baritiu Street, No. 26-28, 400027 Cluj-Napoca, Romania; (D.M.); (P.M.); (S.N.)
| |
Collapse
|
42
|
Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021; 22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.
Collapse
Affiliation(s)
| | | | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
| |
Collapse
|
43
|
Su R, Zhang D, Liu J, Cheng C. MSU-Net: Multi-Scale U-Net for 2D Medical Image Segmentation. Front Genet 2021; 12:639930. [PMID: 33679900 PMCID: PMC7928319 DOI: 10.3389/fgene.2021.639930] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Accepted: 01/20/2021] [Indexed: 11/15/2022] Open
Abstract
Aiming at the limitation of the convolution kernel with a fixed receptive field and unknown prior to optimal network width in U-Net, multi-scale U-Net (MSU-Net) is proposed by us for medical image segmentation. First, multiple convolution sequence is used to extract more semantic features from the images. Second, the convolution kernel with different receptive fields is used to make features more diverse. The problem of unknown network width is alleviated by efficient integration of convolution kernel with different receptive fields. In addition, the multi-scale block is extended to other variants of the original U-Net to verify its universality. Five different medical image segmentation datasets are used to evaluate MSU-Net. A variety of imaging modalities are included in these datasets, such as electron microscopy, dermoscope, ultrasound, etc. Intersection over Union (IoU) of MSU-Net on each dataset are 0.771, 0.867, 0.708, 0.900, and 0.702, respectively. Experimental results show that MSU-Net achieves the best performance on different datasets. Our implementation is available at https://github.com/CN-zdy/MSU_Net.
Collapse
Affiliation(s)
- Run Su
- Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China
- Science Island Branch of Graduate School, University of Science and Technology of China, Hefei, China
| | - Deyun Zhang
- School of Engineering, Anhui Agricultural University, Hefei, China
| | - Jinhuai Liu
- Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China
- Science Island Branch of Graduate School, University of Science and Technology of China, Hefei, China
| | - Chuandong Cheng
- Department of Neurosurgery, The First Affiliated Hospital of University of Science and Technology of China (USTC), Hefei, China
- Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Anhui Province Key Laboratory of Brain Function and Brain Disease, Hefei, China
| |
Collapse
|
44
|
Le NQK, Ho QT, Nguyen TTD, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform 2021; 22:6128847. [PMID: 33539511 DOI: 10.1093/bib/bbab005] [Citation(s) in RCA: 75] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 01/01/2021] [Accepted: 01/03/2021] [Indexed: 01/11/2023] Open
Abstract
Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | - Quang-Thai Ho
- College of Information and Communication Technology, Can Tho University, Vietnam
| | | | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taiwan
| |
Collapse
|
45
|
Wu F, Yang R, Zhang C, Zhang L. A deep learning framework combined with word embedding to identify DNA replication origins. Sci Rep 2021; 11:844. [PMID: 33436981 PMCID: PMC7804333 DOI: 10.1038/s41598-020-80670-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 12/24/2020] [Indexed: 01/29/2023] Open
Abstract
The DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable period of time. To simplify the identification process of eukaryote's ORIs, most of existing methods are developed by traditional machine learning algorithms, and target to the gene sequences with a fixed length. Consequently, the identification results are not satisfying, i.e. there is still great room for improvement. To break through the limitations in previous studies, this paper develops sequence segmentation methods, and employs the word embedding technique, 'Word2vec', to convert gene sequences into word vectors, thereby grasping the inner correlations of gene sequences with different lengths. Then, a deep learning framework to perform the ORI identification task is constructed by a convolutional neural network with an embedding layer. On the basis of the analysis of similarity reduction dimensionality diagram, Word2vec can effectively transform the inner relationship among words into numerical feature. For four species in this study, the best models are obtained with the overall accuracy of 0.975, 0.765, 0.885, 0.967, the Matthew's correlation coefficient of 0.940, 0.530, 0.771, 0.934, and the AUC of 0.975, 0.800, 0.888, 0.981, which indicate that the proposed predictor has a stable ability and provide a high confidence coefficient to classify both of ORIs and non-ORIs. Compared with state-of-the-art methods, the proposed predictor can achieve ORI identification with significant improvement. It is therefore reasonable to anticipate that the proposed method will make a useful high throughput tool for genome analysis.
Collapse
Affiliation(s)
- Feng Wu
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Runtao Yang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China.
| | - Chengjin Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Lina Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| |
Collapse
|
46
|
Shujaat M, Wahab A, Tayara H, Chong KT. pcPromoter-CNN: A CNN-Based Prediction and Classification of Promoters. Genes (Basel) 2020; 11:genes11121529. [PMID: 33371507 PMCID: PMC7767505 DOI: 10.3390/genes11121529] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Revised: 12/11/2020] [Accepted: 12/18/2020] [Indexed: 01/13/2023] Open
Abstract
A promoter is a small region within the DNA structure that has an important role in initiating transcription of a specific gene in the genome. Different types of promoters are recognized by their different functions. Due to the importance of promoter functions, computational tools for the prediction and classification of a promoter are highly desired. Promoters resemble each other; therefore, their precise classification is an important challenge. In this study, we propose a convolutional neural network (CNN)-based tool, the pcPromoter-CNN, for application in the prediction of promotors and their classification into subclasses σ70, σ54, σ38, σ32, σ28 and σ24. This CNN-based tool uses a one-hot encoding scheme for promoter classification. The tools architecture was trained and tested on a benchmark dataset. To evaluate its classification performance, we used four evaluation metrics. The model exhibited notable improvement over that of existing state-of-the-art tools.
Collapse
Affiliation(s)
- Muhammad Shujaat
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
- Department of Computer Sciences, Bahria University, Lahore 54000, Pakistan
| | - Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| |
Collapse
|
47
|
Le NQK, Do DT, Hung TNK, Lam LHT, Huynh TT, Nguyen NTK. A Computational Framework Based on Ensemble Deep Neural Networks for Essential Genes Identification. Int J Mol Sci 2020; 21:E9070. [PMID: 33260643 PMCID: PMC7730808 DOI: 10.3390/ijms21239070] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 11/25/2020] [Accepted: 11/26/2020] [Indexed: 01/13/2023] Open
Abstract
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Duyen Thi Do
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei 106, Taiwan;
| | - Truong Nguyen Khanh Hung
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Department of Orthopedic and Trauma, Cho Ray Hospital, Ho Chi Minh 70000, Vietnam
| | - Luu Ho Thanh Lam
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Intensive Care Unit, Children’s Hospital 2, Ho Chi Minh 70000, Vietnam
| | - Tuan-Tu Huynh
- Department of Electrical Engineering, Yuan Ze University, Taoyuan 320, Taiwan;
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Dong Nai 76120, Vietnam
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, Taipei Medical University, Taipei 110, Taiwan;
| |
Collapse
|
48
|
Kwiecien K, Brzoza P, Bak M, Majewski P, Skulimowska I, Bednarczyk K, Cichy J, Kwitniewski M. The methylation status of the chemerin promoter region located from - 252 to + 258 bp regulates constitutive but not acute-phase cytokine-inducible chemerin expression levels. Sci Rep 2020; 10:13702. [PMID: 32792625 PMCID: PMC7426834 DOI: 10.1038/s41598-020-70625-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Accepted: 07/29/2020] [Indexed: 12/05/2022] Open
Abstract
Chemerin is a chemoattractant protein with adipokine properties encoded by the retinoic acid receptor responder 2 (RARRES2) gene. It has gained more attention in the past few years due to its multilevel impact on metabolism and immune responses. However, mechanisms controlling the constitutive and regulated expression of RARRES2 in a variety of cell types remain obscure. To our knowledge, this report is the first to show that DNA methylation plays an important role in the cell-specific expression of RARRES2 in adipocytes, hepatocytes, and B lymphocytes. Using luciferase reporter assays, we determined the proximal fragment of the RARRES2 gene promoter, located from - 252 to + 258 bp, to be a key regulator of transcription. Moreover, we showed that chemerin expression is regulated in murine adipocytes by acute-phase cytokines, interleukin 1β and oncostatin M. In contrast with adipocytes, these cytokines exerted a weak, if any, response in mouse hepatocytes, suggesting that the effects of IL-1β and OSM on chemerin expression is specific to fat tissue. Together, our findings highlight previously uncharacterized mediators and mechanisms that control chemerin expression.
Collapse
Affiliation(s)
- Kamila Kwiecien
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Piotr Brzoza
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Maciej Bak
- Swiss Institute of Bioinformatics, Biozentrum, University of Basel, 4056, Basel, Switzerland
| | - Pawel Majewski
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Izabella Skulimowska
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Kamil Bednarczyk
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Joanna Cichy
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Mateusz Kwitniewski
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland.
| |
Collapse
|
49
|
Do DT, Le TQT, Le NQK. Using deep neural networks and biological subwords to detect protein S-sulfenylation sites. Brief Bioinform 2020; 22:5866114. [PMID: 32613242 DOI: 10.1093/bib/bbaa128] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/11/2020] [Accepted: 05/26/2020] [Indexed: 12/11/2022] Open
Abstract
Protein S-sulfenylation is one kind of crucial post-translational modifications (PTMs) in which the hydroxyl group covalently binds to the thiol of cysteine. Some recent studies have shown that this modification plays an important role in signaling transduction, transcriptional regulation and apoptosis. To date, the dynamic of sulfenic acids in proteins remains unclear because of its fleeting nature. Identifying S-sulfenylation sites, therefore, could be the key to decipher its mysterious structures and functions, which are important in cell biology and diseases. However, due to the lack of effective methods, scientists in this field tend to be limited in merely a handful of some wet lab techniques that are time-consuming and not cost-effective. Thus, this motivated us to develop an in silico model for detecting S-sulfenylation sites only from protein sequence information. In this study, protein sequences served as natural language sentences comprising biological subwords. The deep neural network was consequentially employed to perform classification. The performance statistics within the independent dataset including sensitivity, specificity, accuracy, Matthews correlation coefficient and area under the curve rates achieved 85.71%, 69.47%, 77.09%, 0.5554 and 0.833, respectively. Our results suggested that the proposed method (fastSulf-DNN) achieved excellent performance in predicting S-sulfenylation sites compared to other well-known tools on a benchmark dataset.
Collapse
Affiliation(s)
- Duyen Thi Do
- Faculty of Applied Sciences, Ton Duc Thang University
| | | | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University
| |
Collapse
|