1
|
Maciejewski E, Horvath S, Ernst J. CMImpute: cross-species and tissue imputation of species-level DNA methylation samples across mammalian species. Genome Biol 2025; 26:133. [PMID: 40394556 DOI: 10.1186/s13059-025-03561-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 03/26/2025] [Indexed: 05/22/2025] Open
Abstract
The large-scale application of the mammalian methylation array has substantially expanded the availability of DNA methylation data in mammalian species. However, this data captures only a small portion of species-tissue combinations. To address this, we develop CMImpute (Cross-species Methylation Imputation), a method based on a conditional variational autoencoder, to impute DNA methylation representing species-tissue combinations. We demonstrate that CMImpute achieves strong sample-wise correlation between imputed and observed values. Using CMImpute and data from 348 species and 59 tissue types, we impute methylation data for 19,786 new species-tissue combinations. We expect CMImpute will be a useful resource for DNA methylation analyses.
Collapse
Affiliation(s)
- Emily Maciejewski
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Steve Horvath
- Dept. of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Dept. Of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Altos Labs, Cambridge, WA14 2DT, UK
| | - Jason Ernst
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Molecular Biology Institute, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
| |
Collapse
|
2
|
Hajebi Khaniki S, Shokoohi F. Data-Driven Identification of Early Cancer-Associated Genes via Penalized Trans-Dimensional Hidden Markov Models. Biomolecules 2025; 15:294. [PMID: 40001597 PMCID: PMC11853217 DOI: 10.3390/biom15020294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Revised: 02/13/2025] [Accepted: 02/13/2025] [Indexed: 02/27/2025] Open
Abstract
Colorectal cancer (CRC) is a significant worldwide health problem due to its high prevalence, mortality rates, and frequent diagnosis at advanced stages. While diagnostic and therapeutic approaches have evolved, the underlying mechanisms driving CRC initiation and progression are not yet fully understood. Early detection is critical for improving patient survival, as initial cancer stages often exhibit epigenetic changes-such as DNA methylation-that regulate gene expression and tumor progression. Identifying DNA methylation patterns and key survival-related genes in CRC could thus enhance diagnostic accuracy and extend patient lifespans. In this study, we apply two of our recently developed methods for identifying differential methylation and analyzing survival using a sparse, finite mixture of accelerated failure time regression models, focusing on key genes and pathways in CRC datasets. Our approach outperforms two other leading methods, yielding robust findings and identifying novel differentially methylated cytosines. We found that CRC patient survival time follows a two-component mixture regression model, where genes CDH11, EPB41L3, and DOCK2 are active in the more aggressive form of CRC, whereas TMEM215, PPP1R14A, GPR158, and NAPSB are active in the less aggressive form.
Collapse
Affiliation(s)
- Saeedeh Hajebi Khaniki
- Department of Biostatistics, School of Health, Mashhad University of Medical Sciences, Mashhad 9137673119, Iran;
| | - Farhad Shokoohi
- Department of Mathematical Sciences, University of Nevada Las Vegas, Las Vegas, NV 89154, USA
| |
Collapse
|
3
|
Apoorva, Handa V, Batra S, Arora V. Advancing epigenetic profiling in cervical cancer: machine learning techniques for classifying DNA methylation patterns. 3 Biotech 2024; 14:264. [PMID: 39391214 PMCID: PMC11461404 DOI: 10.1007/s13205-024-04107-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Accepted: 09/24/2024] [Indexed: 10/12/2024] Open
Abstract
This study investigates the ability to predict DNA methylation patterns in cervical cancer cells using decision-tree-based ensemble approaches and neural network-based models. The research findings suggest that a model based on random forest achieves a significant prediction accuracy of 91.35%. This projection was derived from comprehensive experimentation and a meticulous performance evaluation of the random forest model, employing a range of measures including Accuracy, Sensitivity, Specificity, Matthews Correlation Coefficient, F1-score, Recall, and Precision. The results indicate that the random forest model exhibits superior performance compared to other tree-based models such as the Simple Decision Tree and XGBoost, as well as neural network-based models including Convolutional Neural Networks, Feed Forward Networks, and Wavelet Neural Networks. The findings indicate that using random forest-based techniques has great potential for future study and might be highly valuable in clinical applications, especially in improving diagnostic and treatment strategies based on epigenetic profiles.
Collapse
Affiliation(s)
- Apoorva
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, India
| | - Vikas Handa
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, India
| | - Shalini Batra
- Computer Science & Engineering Department, Thapar Institute of Engineering & Technology, Patiala, India
| | - Vinay Arora
- Computer Science & Engineering Department, Thapar Institute of Engineering & Technology, Patiala, India
| |
Collapse
|
4
|
Yu X, Xu J, Song B, Zhu R, Liu J, Liu YF, Ma YJ. The role of epigenetics in women's reproductive health: the impact of environmental factors. Front Endocrinol (Lausanne) 2024; 15:1399757. [PMID: 39345884 PMCID: PMC11427273 DOI: 10.3389/fendo.2024.1399757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 08/28/2024] [Indexed: 10/01/2024] Open
Abstract
This paper explores the significant role of epigenetics in women's reproductive health, focusing on the impact of environmental factors. It highlights the crucial link between epigenetic modifications-such as DNA methylation and histones post-translational modifications-and reproductive health issues, including infertility and pregnancy complications. The paper reviews the influence of pollutants like PM2.5, heavy metals, and endocrine disruptors on gene expression through epigenetic mechanisms, emphasizing the need for understanding how dietary, lifestyle choices, and exposure to chemicals affect gene expression and reproductive health. Future research directions include deeper investigation into epigenetics in female reproductive health and leveraging gene editing to mitigate epigenetic changes for improving IVF success rates and managing reproductive disorders.
Collapse
Affiliation(s)
- Xinru Yu
- College Of Pharmacy, Shandong University of Traditional Chinese Medicine, Jinan, Shandong, China
| | - Jiawei Xu
- College Of Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine School, Jinan, Shandong, China
| | - Bihan Song
- College Of Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine School, Jinan, Shandong, China
| | - Runhe Zhu
- College Of Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine School, Jinan, Shandong, China
| | - Jiaxin Liu
- College Of Pharmacy, Shandong University of Traditional Chinese Medicine, Jinan, Shandong, China
| | - Yi Fan Liu
- Medical College, Shandong University of Traditional Chinese Medicine, Jinan, Shandong, China
| | - Ying Jie Ma
- The First Clinical College, Shandong University of Traditional Chinese Medicine, Jinan, Shandong, China
| |
Collapse
|
5
|
Borodko DD, Zhenilo SV, Sharko FS. Search for differentially methylated regions in ancient and modern genomes. Vavilovskii Zhurnal Genet Selektsii 2023; 27:820-828. [PMID: 38213708 PMCID: PMC10777292 DOI: 10.18699/vjgb-23-95] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Revised: 10/04/2023] [Accepted: 10/05/2023] [Indexed: 01/13/2024] Open
Abstract
Currently, active research is focused on investigating the mechanisms that regulate the development of various pathologies and their evolutionary dynamics. Epigenetic mechanisms, such as DNA methylation, play a significant role in evolutionary processes, as their changes have a faster impact on the phenotype compared to mutagenesis. In this study, we attempted to develop an algorithm for identifying differentially methylated regions associated with metabolic syndrome, which have undergone methylation changes in humans during the transition from a hunter-gatherer to a sedentary lifestyle. The application of existing whole-genome bisulfite sequencing methods is limited for ancient samples due to their low quality and fragmentation, and the approach to obtaining DNA methylation profiles differs significantly between ancient hunter-gatherer samples and modern tissues. In this study, we validated DamMet, an algorithm for reconstructing ancient methylomes. Application of DamMet to Neanderthal and Denisovan genomes showed a moderate level of correlation with previously published methylation profiles and demonstrated an underestimation of methylation levels in the reconstructed profiles by an average of 15-20 %. Additionally, we developed a new Python-based algorithm that allows for the comparison of methylomes in ancient and modern samples, despite the absence of methylation profiles in modern bone tissue within the context of obesity. This analysis involves a two-step data processing approach, where the first step involves the identification and filtration of tissue-specific methylation regions, and the second step focuses on the direct search for differentially methylated regions in specific areas associated with the researcher's target condition. By applying this algorithm to test data, we identified 38 differentially methylated regions associated with obesity, the majority of which were located in promoter regions. The pipeline demonstrated sufficient efficiency in detecting these regions. These results confirm the feasibility of reconstructing DNA methylation profiles in ancient samples and comparing them with modern methylomes. Furthermore, possibilities for further methodological development and the implementation of a new step for studying differentially methylated positions associated with evolutionary processes are discussed.
Collapse
Affiliation(s)
- D D Borodko
- Federal Research Center "Fundamentals of Biotechnology" of the Russian Academy of Sciences, Moscow, Russia
| | - S V Zhenilo
- Federal Research Center "Fundamentals of Biotechnology" of the Russian Academy of Sciences, Moscow, Russia
| | - F S Sharko
- Federal Research Center "Fundamentals of Biotechnology" of the Russian Academy of Sciences, Moscow, Russia
| |
Collapse
|
6
|
Maciejewski E, Horvath S, Ernst J. Cross-species and tissue imputation of species-level DNA methylation samples across mammalian species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.26.568769. [PMID: 38076978 PMCID: PMC10705269 DOI: 10.1101/2023.11.26.568769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2023]
Abstract
DNA methylation data offers valuable insights into various aspects of mammalian biology. The recent introduction and large-scale application of the mammalian methylation array has significantly expanded the availability of such data across conserved sites in many mammalian species. In our study, we consider 13,245 samples profiled on this array encompassing 348 species and 59 tissues from 746 species-tissue combinations. While having some coverage of many different species and tissue types, this data captures only 3.6% of potential species-tissue combinations. To address this gap, we developed CMImpute (Cross-species Methylation Imputation), a method based on a Conditional Variational Autoencoder, to impute DNA methylation for non-profiled species-tissue combinations. In cross-validation, we demonstrate that CMImpute achieves a strong correlation with actual observed values, surpassing several baseline methods. Using CMImpute we imputed methylation data for 19,786 new species-tissue combinations. We believe that both CMImpute and our imputed data resource will be useful for DNA methylation analyses across a wide range of mammalian species.
Collapse
Affiliation(s)
- Emily Maciejewski
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Steve Horvath
- Dept. of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095 USA
- Dept. of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, CA 90095, US
- Altos Labs, Cambridge, UK, WA14 2DT
| | - Jason Ernst
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research at University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Molecular Biology Institute, University of California, Los Angeles, Los Angeles, CA 90095, USA
| |
Collapse
|
7
|
Park S, Rehman MU, Ullah F, Tayara H, Chong KT. iCpG-Pos: an accurate computational approach for identification of CpG sites using positional features on single-cell whole genome sequence data. Bioinformatics 2023; 39:btad474. [PMID: 37555812 PMCID: PMC10444964 DOI: 10.1093/bioinformatics/btad474] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 05/11/2023] [Accepted: 08/08/2023] [Indexed: 08/10/2023] Open
Abstract
MOTIVATION The investigation of DNA methylation can shed light on the processes underlying human well-being and help determine overall human health. However, insufficient coverage makes it challenging to implement single-stranded DNA methylation sequencing technologies, highlighting the need for an efficient prediction model. Models are required to create an understanding of the underlying biological systems and to project single-cell (methylated) data accurately. RESULTS In this study, we developed positional features for predicting CpG sites. Positional characteristics of the sequence are derived using data from CpG regions and the separation between nearby CpG sites. Multiple optimized classifiers and different ensemble learning approaches are evaluated. The OPTUNA framework is used to optimize the algorithms. The CatBoost algorithm followed by the stacking algorithm outperformed existing DNA methylation identifiers. AVAILABILITY AND IMPLEMENTATION The data and methodologies used in this study are openly accessible to the research community. Researchers can access the positional features and algorithms used for predicting CpG site methylation patterns. To achieve superior performance, we employed the CatBoost algorithm followed by the stacking algorithm, which outperformed existing DNA methylation identifiers. The proposed iCpG-Pos approach utilizes only positional features, resulting in a substantial reduction in computational complexity compared to other known approaches for detecting CpG site methylation patterns. In conclusion, our study introduces a novel approach, iCpG-Pos, for predicting CpG site methylation patterns. By focusing on positional features, our model offers both accuracy and efficiency, making it a promising tool for advancing DNA methylation research and its applications in human health and well-being.
Collapse
Affiliation(s)
- Sehi Park
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Mobeen Ur Rehman
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Farman Ullah
- College of Information Technology in the United Arab Emirates University (UAEU), Abu Dhabi 15551, UAE
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
8
|
Meijer M, Franke B, Sandi C, Klein M. Epigenome-wide DNA methylation in externalizing behaviours: A review and combined analysis. Neurosci Biobehav Rev 2023; 145:104997. [PMID: 36566803 PMCID: PMC12042733 DOI: 10.1016/j.neubiorev.2022.104997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Revised: 11/24/2022] [Accepted: 12/09/2022] [Indexed: 12/24/2022]
Abstract
DNA methylation (DNAm) is one of the most frequently studied epigenetic mechanisms facilitating the interplay of genomic and environmental factors, which can contribute to externalizing behaviours and related psychiatric disorders. Previous epigenome-wide association studies (EWAS) for externalizing behaviours have been limited in sample size, and, therefore, candidate genes and biomarkers with robust evidence are still lacking. We 1) performed a systematic literature review of EWAS of attention-deficit/hyperactivity disorder (ADHD)- and aggression-related behaviours conducted in peripheral tissue and cord blood and 2) combined the most strongly associated DNAm sites observed in individual studies (p < 10-3) to identify candidate genes and biological systems for ADHD and aggressive behaviours. We observed enrichment for neuronal processes and neuronal cell marker genes for ADHD. Astrocyte and granulocytes cell markers among genes annotated to DNAm sites were relevant for both ADHD and aggression-related behaviours. Only 1 % of the most significant epigenetic findings for ADHD/ADHD symptoms were likely to be directly explained by genetic factors involved in ADHD. Finally, we discuss how the field would greatly benefit from larger sample sizes and harmonization of assessment instruments.
Collapse
Affiliation(s)
- Mandy Meijer
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands; Laboratory of Behavioural Genetics, Brain Mind Institute, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Barbara Franke
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands; Department of Psychiatry, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands
| | - Carmen Sandi
- Laboratory of Behavioural Genetics, Brain Mind Institute, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Marieke Klein
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands; Department of Psychiatry, University of California, La Jolla, San Diego, CA, 92093, USA.
| |
Collapse
|
9
|
Iqbal W, Zhou W. Computational Methods for Single-cell DNA Methylome Analysis. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:48-66. [PMID: 35718270 PMCID: PMC10372927 DOI: 10.1016/j.gpb.2022.05.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 04/28/2022] [Accepted: 05/10/2022] [Indexed: 11/19/2022]
Abstract
Dissecting intercellular epigenetic differences is key to understanding tissue heterogeneity. Recent advances in single-cell DNA methylome profiling have presented opportunities to resolve this heterogeneity at the maximum resolution. While these advances enable us to explore frontiers of chromatin biology and better understand cell lineage relationships, they pose new challenges in data processing and interpretation. This review surveys the current state of computational tools developed for single-cell DNA methylome data analysis. We discuss critical components of single-cell DNA methylome data analysis, including data preprocessing, quality control, imputation, dimensionality reduction, cell clustering, supervised cell annotation, cell lineage reconstruction, gene activity scoring, and integration with transcriptome data. We also highlight unique aspects of single-cell DNA methylome data analysis and discuss how techniques common to other single-cell omics data analyses can be adapted to analyze DNA methylomes. Finally, we discuss existing challenges and opportunities for future development.
Collapse
Affiliation(s)
- Waleed Iqbal
- Center for Computational and Genomic Medicine, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Wanding Zhou
- Center for Computational and Genomic Medicine, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA; Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
10
|
Dodlapati S, Jiang Z, Sun J. Completing Single-Cell DNA Methylome Profiles via Transfer Learning Together With KL-Divergence. Front Genet 2022; 13:910439. [PMID: 35938031 PMCID: PMC9353187 DOI: 10.3389/fgene.2022.910439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/25/2022] [Indexed: 11/13/2022] Open
Abstract
The high level of sparsity in methylome profiles obtained using whole-genome bisulfite sequencing in the case of low biological material amount limits its value in the study of systems in which large samples are difficult to assemble, such as mammalian preimplantation embryonic development. The recently developed computational methods for addressing the sparsity by imputing missing have their limits when the required minimum data coverage or profiles of the same tissue in other modalities are not available. In this study, we explored the use of transfer learning together with Kullback-Leibler (KL) divergence to train predictive models for completing methylome profiles with very low coverage (below 2%). Transfer learning was used to leverage less sparse profiles that are typically available for different tissues for the same species, while KL divergence was employed to maximize the usage of information carried in the input data. A deep neural network was adopted to extract both DNA sequence and local methylation patterns for imputation. Our study of training models for completing methylome profiles of bovine oocytes and early embryos demonstrates the effectiveness of transfer learning and KL divergence, with individual increase of 29.98 and 29.43%, respectively, in prediction performance and 38.70% increase when the two were used together. The drastically increased data coverage (43.80-73.6%) after imputation powers downstream analyses involving methylomes that cannot be effectively done using the very low coverage profiles (0.06-1.47%) before imputation.
Collapse
Affiliation(s)
- Sanjeeva Dodlapati
- Department of Computer Science, Old Dominion University, Norfolk, VA, United States
| | - Zongliang Jiang
- School of Animal Sciences, AgCenter, Louisiana State University, Baton Rouge, LA, United States
| | - Jiangwen Sun
- Department of Computer Science, Old Dominion University, Norfolk, VA, United States
| |
Collapse
|
11
|
Tanić M, Moghul I, Rodney S, Dhami P, Vaikkinen H, Ambrose J, Barrett J, Feber A, Beck S. Comparison and imputation-aided integration of five commercial platforms for targeted DNA methylome analysis. Nat Biotechnol 2022; 40:1478-1487. [PMID: 35654977 DOI: 10.1038/s41587-022-01336-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 04/28/2022] [Indexed: 11/09/2022]
Abstract
Targeted bisulfite sequencing (TBS) has become the method of choice for the cost-effective, targeted analysis of the human methylome at base-pair resolution. In this study, we benchmarked five commercially available TBS platforms-three hybridization capture-based (Agilent, Roche and Illumina) and two reduced-representation-based (Diagenode and NuGen)-across 11 samples. Two samples were also compared with whole-genome DNA methylation sequencing with the Illumina and Oxford Nanopore platforms. We assessed workflow complexity, on/off-target performance, coverage, accuracy and reproducibility. Although all platforms produced robust and reproducible data, major differences in the number and identity of the CpG sites covered make it difficult to compare datasets generated on different platforms. To overcome this limitation, we applied imputation and show that it improves interoperability from an average of 10.35% (0.8 million) to 97% (7.6 million) common CpG sites. Our study provides guidance on which TBS platform to use for different methylome features and offers an imputation-based harmonization solution that allows comparative, integrative analysis.
Collapse
Affiliation(s)
- Miljana Tanić
- University College London, UCL Cancer Institute, London, UK. .,Institute for Oncology and Radiology of Serbia, Experimental Oncology Department, Belgrade, Serbia.
| | - Ismail Moghul
- University College London, UCL Cancer Institute, London, UK
| | - Simon Rodney
- University College London, UCL Cancer Institute, London, UK
| | - Pawan Dhami
- University College London, UCL Cancer Institute, London, UK.,NIHR Biomedical Research Centre, Guy's and St. Thomas' NHS Foundation Trust, Great Maze Pond, London, UK
| | - Heli Vaikkinen
- NIHR Biomedical Research Centre, Guy's and St. Thomas' NHS Foundation Trust, Great Maze Pond, London, UK.,University College London, Genomics and Genome Engineering Translational Technology Platform, London, UK
| | - John Ambrose
- University College London, Bill Lyons Informatics Centre, London, UK
| | - James Barrett
- University College London, UCL Cancer Institute, London, UK
| | - Andrew Feber
- University College London, UCL Cancer Institute, London, UK.,University College London, Division of Surgery and Interventional Science, London, UK.,Royal Marsden Hospital, Molecular Pathology, London, UK
| | - Stephan Beck
- University College London, UCL Cancer Institute, London, UK.
| |
Collapse
|
12
|
Rahman MS, Chowdhury AH, Amrin M. Accuracy comparison of ARIMA and XGBoost forecasting models in predicting the incidence of COVID-19 in Bangladesh. PLOS GLOBAL PUBLIC HEALTH 2022; 2:e0000495. [PMID: 36962227 PMCID: PMC10021465 DOI: 10.1371/journal.pgph.0000495] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 04/27/2022] [Indexed: 04/19/2023]
Abstract
Accurate predictive time series modelling is important in public health planning and response during the emergence of a novel pandemic. Therefore, the aims of the study are three-fold: (a) to model the overall trend of COVID-19 confirmed cases and deaths in Bangladesh; (b) to generate a short-term forecast of 8 weeks of COVID-19 cases and deaths; (c) to compare the predictive accuracy of the Autoregressive Integrated Moving Average (ARIMA) and eXtreme Gradient Boosting (XGBoost) for precise modelling of non-linear features and seasonal trends of the time series. The data were collected from the onset of the epidemic in Bangladesh from the Directorate General of Health Service (DGHS) and Institute of Epidemiology, Disease Control and Research (IEDCR). The daily confirmed cases and deaths of COVID-19 of 633 days in Bangladesh were divided into several training and test sets. The ARIMA and XGBoost models were established using those training data, and the test sets were used to evaluate each model's ability to forecast and finally averaged all the predictive performances to choose the best model. The predictive accuracy of the models was assessed using the mean absolute error (MAE), mean percentage error (MPE), root mean square error (RMSE) and mean absolute percentage error (MAPE). The findings reveal the existence of a nonlinear trend and weekly seasonality in the dataset. The average error measures of the ARIMA model for both COVID-19 confirmed cases and deaths were lower than XGBoost model. Hence, in our study, the ARIMA model performed better than the XGBoost model in predicting COVID-19 confirmed cases and deaths in Bangladesh. The suggested prediction model might play a critical role in estimating the spread of a novel pandemic in Bangladesh and similar countries.
Collapse
|
13
|
|
14
|
De Waele G, Clauwaert J, Menschaert G, Waegeman W. CpG Transformer for imputation of single-cell methylomes. Bioinformatics 2021; 38:597-603. [PMID: 34718418 PMCID: PMC8756163 DOI: 10.1093/bioinformatics/btab746] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Revised: 10/19/2021] [Accepted: 10/25/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The adoption of current single-cell DNA methylation sequencing protocols is hindered by incomplete coverage, outlining the need for effective imputation techniques. The task of imputing single-cell (methylation) data requires models to build an understanding of underlying biological processes. RESULTS We adapt the transformer neural network architecture to operate on methylation matrices through combining axial attention with sliding window self-attention. The obtained CpG Transformer displays state-of-the-art performances on a wide range of scBS-seq and scRRBS-seq datasets. Furthermore, we demonstrate the interpretability of CpG Transformer and illustrate its rapid transfer learning properties, allowing practitioners to train models on new datasets with a limited computational and time budget. AVAILABILITY AND IMPLEMENTATION CpG Transformer is freely available at https://github.com/gdewael/cpg-transformer. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gaetan De Waele
- Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent 9000, Belgium
| | - Jim Clauwaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent 9000, Belgium
| | - Gerben Menschaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent 9000, Belgium
| | | |
Collapse
|
15
|
Li G, Raffield L, Logue M, Miller MW, Santos HP, O'Shea TM, Fry RC, Li Y. CUE: CpG impUtation ensemble for DNA methylation levels across the human methylation450 (HM450) and EPIC (HM850) BeadChip platforms. Epigenetics 2021; 16:851-861. [PMID: 33016200 PMCID: PMC8330997 DOI: 10.1080/15592294.2020.1827716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Revised: 08/08/2020] [Accepted: 09/09/2020] [Indexed: 10/23/2022] Open
Abstract
DNA methylation at CpG dinucleotides is one of the most extensively studied epigenetic marks. With technological advancements, geneticists can profile DNA methylation with multiple reliable approaches. However, profiling platforms can differ substantially in the CpGs they assess, consequently hindering integrated analysis across platforms. Here, we present CpG impUtation Ensemble (CUE), which leverages multiple classical statistical and modern machine learning methods, to impute from the Illumina HumanMethylation450 (HM450) BeadChip to the Illumina HumanMethylationEPIC (HM850) BeadChip. Data were analysed from two population cohorts with methylation measured both by HM450 and HM850: the Extremely Low Gestational Age Newborns (ELGAN) study (n = 127, placenta) and the VA Boston Posttraumatic Stress Disorder (PTSD) genetics repository (n = 144, whole blood). Cross-validation results show that CUE achieves the lowest predicted root-mean-square error (RMSE) (0.026 in PTSD) and the highest accuracy (99.97% in PTSD) compared with five individual methods tested, including k-nearest-neighbours, logistic regression, penalized functional regression, random forest, and XGBoost. Finally, among all 339,033 HM850-only CpG sites shared between ELGAN and PTSD, CUE successfully imputed 289,604 (85.4%) sites, where success was defined as RMSE < 0.05 and accuracy >95% in PTSD. In summary, CUE is a valuable tool for imputing CpG methylation from the HM450 to HM850 platform.
Collapse
Affiliation(s)
- Gang Li
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Laura Raffield
- Department of Genetics, University of North Carolina, Chapel Hill, NC, USA
| | - Mark Logue
- National Center for PTSD: Behavioral Sciences Division at VA Boston Healthcare System, Boston, MA, USA
- Department of Psychiatry, Boston University School of Medicine, Boston, MA, USA
- Biomedical Genetics, Boston University School of Medicine, Boston, MA, USA
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Mark W Miller
- National Center for PTSD: Behavioral Sciences Division at VA Boston Healthcare System, Boston, MA, USA
- Department of Psychiatry, Boston University School of Medicine, Boston, MA, USA
| | - Hudson P Santos
- School of Nursing, University of North Carolina, Chapel Hill, NC, USA
- Institute for Environmental Health Solutions, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, USA
| | - T Michael O'Shea
- Department of Pediatrics, School of Medicine, University of North Carolina, Chapel Hill, NC, USA
| | - Rebecca C Fry
- Institute for Environmental Health Solutions, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, USA
- Department of Environmental Sciences and Engineering, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, USA
- Curriculum in Toxicology and Environmental Medicine, University of North Carolina, Chapel Hill, NC, USA
| | - Yun Li
- Department of Genetics, University of North Carolina, Chapel Hill, NC, USA
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, USA
- Department of Computer Science, University of North Carolina, Chapel Hill, NC, USA
| |
Collapse
|
16
|
Minegishi R, Gotoh O, Tanaka N, Maruyama R, Chang JT, Mori S. A method of sample-wise region-set enrichment analysis for DNA methylomics. Epigenomics 2021; 13:1081-1093. [PMID: 34241544 DOI: 10.2217/epi-2021-0065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Aim: Gene set analysis has commonly been used to interpret DNA methylome data. However, summarizing the DNA methylation level of a gene is challenging due to variability in the number, density and methylation levels of CpG sites, and the numerous intergenic CpGs. Instead, we propose to use region sets to annotate the DNA methylome. Methods: We developed single sample region-set enrichment analysis for DNA methylome (methyl-ssRSEA) to conduct sample-wise, region-set enrichment analysis. Results: Methyl-ssRSEA can handle both microarray- and sequencing-based platforms and reproducibly recover the known biology from the methylation profiles of peripheral blood cells and breast cancers. The performance was superior to existing tools for region-set analysis in discriminating blood cell types. Conclusion: Methyl-ssRSEA offers a novel way to functionally interpret the DNA methylome in the cell.
Collapse
Affiliation(s)
- Ryu Minegishi
- Project for Development of Innovative Research on Cancer Therapeutics, Cancer Precision Medicine Center, Japanese Foundation for Cancer Research, Tokyo, Japan
| | - Osamu Gotoh
- Project for Development of Innovative Research on Cancer Therapeutics, Cancer Precision Medicine Center, Japanese Foundation for Cancer Research, Tokyo, Japan
| | - Norio Tanaka
- Project for Development of Innovative Research on Cancer Therapeutics, Cancer Precision Medicine Center, Japanese Foundation for Cancer Research, Tokyo, Japan
| | - Reo Maruyama
- Project for Cancer Epigenomics, Cancer Institute, Japanese Foundation for Cancer Research, Tokyo, Japan
| | - Jeffrey T Chang
- Department of Integrative Biology & Pharmacology, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Seiichi Mori
- Project for Development of Innovative Research on Cancer Therapeutics, Cancer Precision Medicine Center, Japanese Foundation for Cancer Research, Tokyo, Japan
| |
Collapse
|
17
|
Ke Y, Rao J, Zhao H, Lu Y, Xiao N, Yang Y. Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting. Bioinformatics 2021; 36:4576-4582. [PMID: 32467966 DOI: 10.1093/bioinformatics/btaa534] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Revised: 05/01/2020] [Accepted: 05/23/2020] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage. RESULTS Here, we have developed a new method for the prediction of genome-wide RNA secondary structure profile from RNA sequence based on the extreme gradient boosting technique. The method achieves predictions with areas under the receiver operating characteristic curve (AUC) >0.9 on three different datasets, and AUC of 0.888 by another independent test on the recently released Zika virus data. These AUCs are consistently >5% greater than those by the CROSS method recently developed based on a shallow neural network. Further analysis on the 1000 Genome Project data showed that our predicted unpaired probabilities are highly correlated (>0.8) with the minor allele frequencies at synonymous, non-synonymous mutations, and mutations in untranslated regions, which were higher than those generated by RNAplfold. Moreover, the prediction over all human mRNA indicated a consistent result with previous observation that there is a periodic distribution of unpaired probability on codons. The accurate predictions by our method indicate that such model trained on genome-wide experimental data might be an alternative for analytical methods. AVAILABILITY AND IMPLEMENTATION The GRASP is available for academic use at https://github.com/sysu-yanglab/GRASP. SUPPLEMENTARY INFORMATION Supplementary data are available online.
Collapse
Affiliation(s)
- Yaobin Ke
- School of Data and Computer Science, Guangzhou 510000, China
| | - Jiahua Rao
- School of Data and Computer Science, Guangzhou 510000, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Guangzhou 510000, China
| | - Yutong Lu
- School of Data and Computer Science, Guangzhou 510000, China
| | - Nong Xiao
- School of Data and Computer Science, Guangzhou 510000, China
| | - Yuedong Yang
- School of Data and Computer Science, Guangzhou 510000, China.,Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University) of Ministry of Education, Guangzhou 510000, China
| |
Collapse
|
18
|
The progress on the estimation of DNA methylation level and the detection of abnormal methylation. QUANTITATIVE BIOLOGY 2021. [DOI: 10.15302/j-qb-022-0289] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
19
|
Alim M, Ye GH, Guan P, Huang DS, Zhou BS, Wu W. Comparison of ARIMA model and XGBoost model for prediction of human brucellosis in mainland China: a time-series study. BMJ Open 2020; 10:e039676. [PMID: 33293308 PMCID: PMC7722837 DOI: 10.1136/bmjopen-2020-039676] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
OBJECTIVES Human brucellosis is a public health problem endangering health and property in China. Predicting the trend and the seasonality of human brucellosis is of great significance for its prevention. In this study, a comparison between the autoregressive integrated moving average (ARIMA) model and the eXtreme Gradient Boosting (XGBoost) model was conducted to determine which was more suitable for predicting the occurrence of brucellosis in mainland China. DESIGN Time-series study. SETTING Mainland China. METHODS Data on human brucellosis in mainland China were provided by the National Health and Family Planning Commission of China. The data were divided into a training set and a test set. The training set was composed of the monthly incidence of human brucellosis in mainland China from January 2008 to June 2018, and the test set was composed of the monthly incidence from July 2018 to June 2019. The mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) were used to evaluate the effects of model fitting and prediction. RESULTS The number of human brucellosis patients in mainland China increased from 30 002 in 2008 to 40 328 in 2018. There was an increasing trend and obvious seasonal distribution in the original time series. For the training set, the MAE, RSME and MAPE of the ARIMA(0,1,1)×(0,1,1)12 model were 338.867, 450.223 and 10.323, respectively, and the MAE, RSME and MAPE of the XGBoost model were 189.332, 262.458 and 4.475, respectively. For the test set, the MAE, RSME and MAPE of the ARIMA(0,1,1)×(0,1,1)12 model were 529.406, 586.059 and 17.676, respectively, and the MAE, RSME and MAPE of the XGBoost model were 249.307, 280.645 and 7.643, respectively. CONCLUSIONS The performance of the XGBoost model was better than that of the ARIMA model. The XGBoost model is more suitable for prediction cases of human brucellosis in mainland China.
Collapse
Affiliation(s)
- Mirxat Alim
- Department of Epidemiology, China Medical University, Shenyang, China
| | - Guo-Hua Ye
- Department of Epidemiology, China Medical University, Shenyang, China
| | - Peng Guan
- Department of Epidemiology, China Medical University, Shenyang, China
| | - De-Sheng Huang
- Department of Mathematics, China Medical University, Shenyang, China
| | - Bao-Sen Zhou
- Department of Epidemiology, China Medical University, Shenyang, China
| | - Wei Wu
- Department of Epidemiology, China Medical University, Shenyang, China
| |
Collapse
|
20
|
Yu F, Xu C, Deng HW, Shen H. A novel computational strategy for DNA methylation imputation using mixture regression model (MRM). BMC Bioinformatics 2020; 21:552. [PMID: 33261550 PMCID: PMC7708217 DOI: 10.1186/s12859-020-03865-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2020] [Accepted: 11/09/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND DNA methylation is an important heritable epigenetic mark that plays a crucial role in transcriptional regulation and the pathogenesis of various human disorders. The commonly used DNA methylation measurement approaches, e.g., Illumina Infinium HumanMethylation-27 and -450 BeadChip arrays (27 K and 450 K arrays) and reduced representation bisulfite sequencing (RRBS), only cover a small proportion of the total CpG sites in the human genome, which considerably limited the scope of the DNA methylation analysis in those studies. RESULTS We proposed a new computational strategy to impute the methylation value at the unmeasured CpG sites using the mixture of regression model (MRM) of radial basis functions, integrating information of neighboring CpGs and the similarities in local methylation patterns across subjects and across multiple genomic regions. Our method achieved a better imputation accuracy over a set of competing methods on both simulated and empirical data, particularly when the missing rate is high. By applying MRM to an RRBS dataset from subjects with low versus high bone mineral density (BMD), we recovered methylation values of ~ 300 K CpGs in the promoter regions of chromosome 17 and identified some novel differentially methylated CpGs that are significantly associated with BMD. CONCLUSIONS Our method is well applicable to the numerous methylation studies. By expanding the coverage of the methylation dataset to unmeasured sites, it can significantly enhance the discovery of novel differential methylation signals and thus reveal the mechanisms underlying various human disorders/traits.
Collapse
Affiliation(s)
- Fangtang Yu
- Center for Bioinformatics and Genomics, Department of Biostatistics and Data Science, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Chao Xu
- Center for Bioinformatics and Genomics, Department of Biostatistics and Data Science, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Hong-Wen Deng
- Center for Bioinformatics and Genomics, Department of Biostatistics and Data Science, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Hui Shen
- Center for Bioinformatics and Genomics, Department of Biostatistics and Data Science, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, 70112, USA.
| |
Collapse
|
21
|
Roels J, Thénoz M, Szarzyńska B, Landfors M, De Coninck S, Demoen L, Provez L, Kuchmiy A, Strubbe S, Reunes L, Pieters T, Matthijssens F, Van Loocke W, Erarslan-Uysal B, Richter-Pechańska P, Declerck K, Lammens T, De Moerloose B, Deforce D, Van Nieuwerburgh F, Cheung LC, Kotecha RS, Mansour MR, Ghesquière B, Van Camp G, Berghe WV, Kowalczyk JR, Szczepański T, Davé UP, Kulozik AE, Goossens S, Curtis DJ, Taghon T, Dawidowska M, Degerman S, Van Vlierberghe P. Aging of preleukemic thymocytes drives CpG island hypermethylation in T-cell acute lymphoblastic leukemia. Blood Cancer Discov 2020; 1:274-289. [PMID: 33179015 PMCID: PMC7116343 DOI: 10.1158/2643-3230.bcd-20-0059] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Revised: 08/06/2020] [Accepted: 09/15/2020] [Indexed: 12/20/2022] Open
Abstract
Cancer cells display DNA hypermethylation at specific CpG islands in comparison to their normal healthy counterparts, but the mechanism that drives this so-called CpG island methylator phenotype (CIMP) remains poorly understood. Here, we show that CpG island methylation in human T-cell acute lymphoblastic leukemia (T-ALL) mainly occurs at promoters of Polycomb Repressor Complex 2 (PRC2) target genes that are not expressed in normal or malignant T-cells and which display a reciprocal association with H3K27me3 binding. In addition, we revealed that this aberrant methylation profile reflects the epigenetic history of T-ALL and is established already in pre-leukemic, self-renewing thymocytes that precede T-ALL development. Finally, we unexpectedly uncover that this age-related CpG island hypermethylation signature in T-ALL is completely resistant to the FDA-approved hypomethylating agent Decitabine. Altogether, we here provide conceptual evidence for the involvement of a pre-leukemic phase characterized by self-renewing thymocytes in the pathogenesis of human T-ALL.
Collapse
Affiliation(s)
- Juliette Roels
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | - Morgan Thénoz
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | | | - Mattias Landfors
- Department of Medical Biosciences, Umeå University, Umeå, Sweden
| | - Stien De Coninck
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | - Lisa Demoen
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | - Lien Provez
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | - Anna Kuchmiy
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
- Department of Diagnostic Sciences, Ghent University, Ghent, Belgium
| | - Steven Strubbe
- Department of Diagnostic Sciences, Ghent University, Ghent, Belgium
| | - Lindy Reunes
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | - Tim Pieters
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | - Filip Matthijssens
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | - Wouter Van Loocke
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| | - Büşra Erarslan-Uysal
- Department of Pediatric Oncology, Hematology, and Immunology, University of Heidelberg, and Hopp Children's Cancer Center at NCT Heidelberg, Heidelberg, Germany
- Molecular Medicine Partnership Unit (MMPU), European Molecular Biology Laboratory (EMBL), University of Heidelberg, Heidelberg, Germany
| | - Paulina Richter-Pechańska
- Department of Pediatric Oncology, Hematology, and Immunology, University of Heidelberg, and Hopp Children's Cancer Center at NCT Heidelberg, Heidelberg, Germany
- Molecular Medicine Partnership Unit (MMPU), European Molecular Biology Laboratory (EMBL), University of Heidelberg, Heidelberg, Germany
| | - Ken Declerck
- Laboratory of Protein Chemistry, Proteomics and Epigenetic Signaling (PPES) and Integrated Personalized and Precision Oncology Network (IPPON), Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Tim Lammens
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
- Department of Pediatric Hematology-Oncology and Stem Cell Transplantation, Ghent University Hospital, Ghent, Belgium
| | - Barbara De Moerloose
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
- Department of Pediatric Hematology-Oncology and Stem Cell Transplantation, Ghent University Hospital, Ghent, Belgium
| | - Dieter Deforce
- Laboratory of Pharmaceutical Biotechnology, Ghent University, Ghent, Belgium
| | | | - Laurence C Cheung
- Telethon Kids Cancer Centre, Telethon Kids Institute, University of Western Australia, Perth, Western Australia
- School of Pharmacy and Biomedical Sciences, Curtin University, Perth, Western Australia
| | - Rishi S Kotecha
- Telethon Kids Cancer Centre, Telethon Kids Institute, University of Western Australia, Perth, Western Australia
- School of Pharmacy and Biomedical Sciences, Curtin University, Perth, Western Australia
| | - Marc R Mansour
- Department of Haematology, University College London Cancer Institute, London, England
| | - Bart Ghesquière
- Metabolomics Expertise Center, VIB Center for Cancer Biology, Leuven, Belgium
| | - Guy Van Camp
- Center of Medical Genetics, University of Antwerp, Antwerp, Belgium
| | - Wim Vanden Berghe
- Laboratory of Protein Chemistry, Proteomics and Epigenetic Signaling (PPES) and Integrated Personalized and Precision Oncology Network (IPPON), Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Jerzy R Kowalczyk
- Department of Pediatric Hematology, Oncology and Transplantology, Medical University of Lublin, Lublin, Poland
| | - Tomasz Szczepański
- Department of Pediatric Hematology and Oncology, Zabrze, Medical University of Silesia, Katowice, Poland
| | - Utpal P Davé
- Roudebush Veterans Affairs Medical Center and Indiana University School of Medicine, Indianapolis, Indiana
| | - Andreas E Kulozik
- Department of Pediatric Oncology, Hematology, and Immunology, University of Heidelberg, and Hopp Children's Cancer Center at NCT Heidelberg, Heidelberg, Germany
- Molecular Medicine Partnership Unit (MMPU), European Molecular Biology Laboratory (EMBL), University of Heidelberg, Heidelberg, Germany
| | - Steven Goossens
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
- Department of Diagnostic Sciences, Ghent University, Ghent, Belgium
| | - David J Curtis
- Australian Centre for Blood Diseases (ACBD), Monash University, Melbourne, Australia
| | - Tom Taghon
- Department of Diagnostic Sciences, Ghent University, Ghent, Belgium
| | | | - Sofie Degerman
- Department of Medical Biosciences, Umeå University, Umeå, Sweden
- Department of Clinical Microbiology, Umeå University, Umeå, Sweden
| | - Pieter Van Vlierberghe
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium.
- Cancer Research Institute Ghent (CRIG), Ghent, Belgium
| |
Collapse
|
22
|
A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform 2020; 107:103466. [DOI: 10.1016/j.jbi.2020.103466] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 05/01/2020] [Accepted: 05/31/2020] [Indexed: 01/09/2023]
|
23
|
Mendik P, Dobronyi L, Hári F, Kerepesi C, Maia-Moço L, Buszlai D, Csermely P, Veres DV. Translocatome: a novel resource for the analysis of protein translocation between cellular organelles. Nucleic Acids Res 2020; 47:D495-D505. [PMID: 30380112 PMCID: PMC6324082 DOI: 10.1093/nar/gky1044] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Accepted: 10/25/2018] [Indexed: 01/02/2023] Open
Abstract
Here we present Translocatome, the first dedicated database of human translocating proteins (URL: http://translocatome.linkgroup.hu). The core of the Translocatome database is the manually curated data set of 213 human translocating proteins listing the source of their experimental validation, several details of their translocation mechanism, their local compartmentalized interactome, as well as their involvement in signalling pathways and disease development. In addition, using the well-established and widely used gradient boosting machine learning tool, XGBoost, Translocatome provides translocation probability values for 13 066 human proteins identifying 1133 and 3268 high- and low-confidence translocating proteins, respectively. The database has user-friendly search options with a UniProt autocomplete quick search and advanced search for proteins filtered by their localization, UniProt identifiers, translocation likelihood or data complexity. Download options of search results, manually curated and predicted translocating protein sets are available on its website. The update of the database is helped by its manual curation framework and connection to the previously published ComPPI compartmentalized protein–protein interaction database (http://comppi.linkgroup.hu). As shown by the application examples of merlin (NF2) and tumor protein 63 (TP63) Translocatome allows a better comprehension of protein translocation as a systems biology phenomenon and can be used as a discovery-tool in the protein translocation field.
Collapse
Affiliation(s)
- Péter Mendik
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| | - Levente Dobronyi
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| | - Ferenc Hári
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| | - Csaba Kerepesi
- Institute for Computer Science and Control (MTA SZTAKI), Hungarian Academy of Sciences, Budapest, Hungary.,Institute of Mathematics, Eötvös Loránd University, Budapest, Hungary
| | - Leonardo Maia-Moço
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary.,Cancer Biology and Epigenetics Group, Research Center of Portuguese Oncology Institute of Porto, Portugal
| | - Donát Buszlai
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| | - Peter Csermely
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| | - Daniel V Veres
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary.,Turbine Ltd., Budapest, Hungary
| |
Collapse
|
24
|
Tang J, Zou J, Zhang X, Fan M, Tian Q, Fu S, Gao S, Fan S. PretiMeth: precise prediction models for DNA methylation based on single methylation mark. BMC Genomics 2020; 21:364. [PMID: 32414326 PMCID: PMC7227319 DOI: 10.1186/s12864-020-6768-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Accepted: 05/04/2020] [Indexed: 11/29/2022] Open
Abstract
Background The computational prediction of methylation levels at single CpG resolution is promising to explore the methylation levels of CpGs uncovered by existing array techniques, especially for the 450 K beadchip array data with huge reserves. General prediction models concentrate on improving the overall prediction accuracy for the bulk of CpG loci while neglecting whether each locus is precisely predicted. This leads to the limited application of the prediction results, especially when performing downstream analysis with high precision requirements. Results Here we reported PretiMeth, a method for constructing precise prediction models for each single CpG locus. PretiMeth used a logistic regression algorithm to build a prediction model for each interested locus. Only one DNA methylation feature that shared the most similar methylation pattern with the CpG locus to be predicted was applied in the model. We found that PretiMeth outperformed other algorithms in the prediction accuracy, and kept robust across platforms and cell types. Furthermore, PretiMeth was applied to The Cancer Genome Atlas data (TCGA), the intensive analysis based on precise prediction results showed that several CpG loci and genes (differentially methylated between the tumor and normal samples) were worthy for further biological validation. Conclusion The precise prediction of single CpG locus is important for both methylation array data expansion and downstream analysis of prediction results. PretiMeth achieved precise modeling for each CpG locus by using only one significant feature, which also suggested that our precise prediction models could be probably used for reference in the probe set design when the DNA methylation beadchip update. PretiMeth is provided as an open source tool via https://github.com/JxTang-bioinformatics/PretiMeth.
Collapse
Affiliation(s)
- Jianxiong Tang
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Jianxiao Zou
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Xiaoran Zhang
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China.,Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Mei Fan
- Chengdu Women's and Children's Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Qi Tian
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shuyao Fu
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shihong Gao
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shicai Fan
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China. .,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 611731, China.
| |
Collapse
|
25
|
Wu C, Muhataer X, Wang W, Deng M, Jin R, Lian Z, Luo D, Li Y, Yang X. Abnormal DNA methylation patterns in patients with infection‑caused leukocytopenia based on methylation microarrays. Mol Med Rep 2020; 21:2335-2348. [PMID: 32323775 PMCID: PMC7185277 DOI: 10.3892/mmr.2020.11061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Accepted: 02/07/2020] [Indexed: 11/05/2022] Open
Abstract
The present study aimed to investigate the association between gene methylation and leukocytopenia from the perspective of gene regulation. A total of 30 patients confirmed as having post-infection leukocytopenia at People's Hospital of Xinjiang Uygur Autonomous Region between January 2016 and June 2017 were successively recruited as the leukocytopenia group; 30 patients with post-infection leukocytosis were enrolled as the leukocytosis group. In addition, 30 healthy volunteers who received a health examination at the hospital during the same period were included as the normal control group. In each group, four individuals were randomly selected for whole genome methylation screening. After selection of key methylation sites, the remaining samples in each group were used for verification using matrix-assisted laser desorption/ionization-time of flight mass spectrometry. The levels of serum complement factors C3 and C5 in the leukocytopenia group were significantly lower than those in the other two groups (P<0.05). According to whole-genome DNA methylation detection, 66 and 27 methylation loci may be associated with leukocytopenia and leukocytosis, respectively. Most of these abnormal loci are located on chromosomes 2, 6, 7, 1, 17 and 11. The rates of WW domain containing E3 ubiquitin protein ligase 2 gene methylation at cytosine-phosphate-guanine (CpG)_1, CpG_5/6 and CpG_7 in the leukocytopenia group were higher than in the other two groups (P<0.05); the rate of AKT2 CpG_1 methylation was higher in the leukocytopenia group than in the other two groups (P<0.05); the rate of calcium-binding atopy-related autoantigen 1 gene CpG_2 methylation was higher in the leukocytosis group than in the normal control group (P<0.05); and the rate of NADPH oxidase 5 gene CpG_3 methylation was higher in the leukocytosis group than in the normal control group (P<0.05). Chemotactic factor secretion and cell migration abnormalities, ubiquitination modification disorders and reduced oxidative burst may participate in infection-complicated leukocytopenia. The results of this study shed new light on the molecular biological mechanisms of infection-complicated leukocytopenia and provide novel avenues for diagnosis and treatment.
Collapse
Affiliation(s)
- Chao Wu
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| | - Xirennayi Muhataer
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| | - Wenyi Wang
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| | - Mingqin Deng
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| | - Rong Jin
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| | - Zhichuang Lian
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| | - Dan Luo
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| | - Yafang Li
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| | - Xiaohong Yang
- Department of Respiratory and Critical Care Medicine, People's Hospital of Xinjiang Uygur Autonomous Region, Urumqi, Xinjiang Uygur Autonomous Region 830001, P.R. China
| |
Collapse
|
26
|
Lv X, Chen J, Lu Y, Chen Z, Xiao N, Yang Y. Accurately Predicting Mutation-Caused Stability Changes from Protein Sequences Using Extreme Gradient Boosting. J Chem Inf Model 2020; 60:2388-2395. [PMID: 32203653 DOI: 10.1021/acs.jcim.0c00064] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Accurately predicting the impact of point mutation on protein stability has crucial roles in protein design and engineering. In this study, we proposed a novel method (BoostDDG) to predict stability changes upon point mutations from protein sequences based on the extreme gradient boosting. We extracted features comprehensively from evolutional information and predicted structures and performed feature selection by a strategy of sequential forward selection. The features and parameters were optimized by homologue-based cross-validation to avoid overfitting. Finally, we found that 14 features from six groups led to the highest Pearson correlation coefficient (PCC) of 0.535, which is consistent with the 0.540 on an independent test. Our method was indicated to consistently outperform other sequence-based methods on three precompiled test sets, and 7363 variants on two proteins (PTEN and TPMT). These results highlighted that BoostDDG is a powerful tool for predicting stability changes upon point mutations from protein sequences.
Collapse
Affiliation(s)
- Xuan Lv
- State Key Laboratory of High-Performance Computing, School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China
| | - Jianwen Chen
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Yutong Lu
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Zhiguang Chen
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Nong Xiao
- State Key Laboratory of High-Performance Computing, School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China.,School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China.,Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-sen University, Ministry of Education, Guangzhou, Guangdong 510275, China
| |
Collapse
|
27
|
Xu T, Zheng X, Li B, Jin P, Qin Z, Wu H. A comprehensive review of computational prediction of genome-wide features. Brief Bioinform 2020; 21:120-134. [PMID: 30462144 PMCID: PMC10233247 DOI: 10.1093/bib/bby110] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Revised: 10/15/2018] [Accepted: 10/16/2018] [Indexed: 12/15/2022] Open
Abstract
There are significant correlations among different types of genetic, genomic and epigenomic features within the genome. These correlations make the in silico feature prediction possible through statistical or machine learning models. With the accumulation of a vast amount of high-throughput data, feature prediction has gained significant interest lately, and a plethora of papers have been published in the past few years. Here we provide a comprehensive review on these published works, categorized by the prediction targets, including protein binding site, enhancer, DNA methylation, chromatin structure and gene expression. We also provide discussions on some important points and possible future directions.
Collapse
Affiliation(s)
- Tianlei Xu
- Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai, China
| | - Ben Li
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Peng Jin
- Department of Human Genetics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Zhaohui Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| |
Collapse
|
28
|
Quang D, Xie X. FactorNet: A deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 2019; 166:40-47. [PMID: 30922998 PMCID: PMC6708499 DOI: 10.1016/j.ymeth.2019.03.020] [Citation(s) in RCA: 98] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Revised: 03/05/2019] [Accepted: 03/20/2019] [Indexed: 01/08/2023] Open
Abstract
Due to the large numbers of transcription factors (TFs) and cell types, querying binding profiles of all valid TF/cell type pairs is not experimentally feasible. To address this issue, we developed a convolutional-recurrent neural network model, called FactorNet, to computationally impute the missing binding data. FactorNet trains on binding data from reference cell types to make predictions on testing cell types by leveraging a variety of features, including genomic sequences, genome annotations, gene expression, and signal data, such as DNase I cleavage. FactorNet implements several convenient strategies to reduce runtime and memory consumption. By visualizing the neural network models, we can interpret how the model predicts binding. We also investigate the variables that affect cross-cell type accuracy, and offer suggestions to improve upon this field. Our method ranked among the top teams in the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge, achieving first place on six of the 13 final round evaluation TF/cell type pairs, the most of any competing team. The FactorNet source code is publicly available, allowing users to reproduce our methodology from the ENCODE-DREAM Challenge.
Collapse
Affiliation(s)
- Daniel Quang
- University of California, Department of Computer Science, Irvine, CA 92697, United States.
| | - Xiaohui Xie
- University of California, Department of Computer Science, Irvine, CA 92697, United States.
| |
Collapse
|
29
|
Zhou L, Ng HK, Drautz-Moses DI, Schuster SC, Beck S, Kim C, Chambers JC, Loh M. Systematic evaluation of library preparation methods and sequencing platforms for high-throughput whole genome bisulfite sequencing. Sci Rep 2019; 9:10383. [PMID: 31316107 PMCID: PMC6637168 DOI: 10.1038/s41598-019-46875-5] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 07/05/2019] [Indexed: 01/30/2023] Open
Abstract
Whole genome bisulfite sequencing (WGBS), with its ability to interrogate methylation status at single CpG site resolution epigenome-wide, is a powerful technique for use in molecular experiments. Here, we aim to advance strategies for accurate and efficient WGBS for application in future large-scale epidemiological studies. We systematically compared the performance of three WGBS library preparation methods with low DNA input requirement (Swift Biosciences Accel-NGS, Illumina TruSeq and QIAGEN QIAseq) on two state-of-the-art sequencing platforms (Illumina NovaSeq and HiSeq X), and also assessed concordance between data generated by WGBS and methylation arrays. Swift achieved the highest proportion of CpG sites assayed and effective coverage at 26x (P < 0.001). TruSeq suffered from the highest proportion of PCR duplicates, while QIAseq failed to deliver across all quality metrics. There was little difference in performance between NovaSeq and HiSeq X, with the exception of higher read duplication rate on the NovaSeq (P < 0.05), likely attributable to the higher cluster densities on its flow cells. Systematic biases exist between WGBS and methylation arrays, with lower precision observed for WGBS across the range of depths investigated. To achieve a level of precision broadly comparable to the methylation array, a minimum coverage of 100x is recommended.
Collapse
Affiliation(s)
- Li Zhou
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
| | - Hong Kiat Ng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
| | - Daniela I Drautz-Moses
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, 637551, Singapore
| | - Stephan C Schuster
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, 637551, Singapore
| | - Stephan Beck
- University College London Cancer Institute, London, WC1E 6BT, United Kingdom
| | | | - John Campbell Chambers
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
- Department of Epidemiology and Biostatistics, Imperial College London, London, W2 1PG, United Kingdom
- Department of Cardiology, Ealing Hospital, London North West Healthcare NHS Trust, Southall, UB1 3HW, United Kingdom
- Imperial College Healthcare NHS Trust, London, W12 0HS, United Kingdom
| | - Marie Loh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore.
- Department of Epidemiology and Biostatistics, Imperial College London, London, W2 1PG, United Kingdom.
- Translational Laboratory in Genetic Medicine, Agency for Science, Technology and Research (A*STAR), Singapore, 138648, Singapore.
| |
Collapse
|
30
|
Jiang L, Wang C, Tang J, Guo F. LightCpG: a multi-view CpG sites detection on single-cell whole genome sequence data. BMC Genomics 2019; 20:306. [PMID: 31014252 PMCID: PMC6480911 DOI: 10.1186/s12864-019-5654-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 03/27/2019] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND DNA methylation plays an important role in multiple biological processes that are closely related to human health. The study of DNA methylation can provide an insight into the mechanism behind human health and can also have a positive effect on the assessment of human health status. However, the available sequencing technology is limited by incomplete CpG coverage. Therefore, it is crucial to discover an efficient and convenient method capable of distinguishing between the states of CpG sites. Previous studies focused on identifying methylation states of the CpG sites in single cell, which only evaluated sequence information or structural information. RESULTS In this paper, we propose a novel model, LightCpG, which combines the positional features with the sequence and structural features to provide information on the CpG sites at two stages. Next, we used the LightGBM model for training of the CpG site identification, and further utilized sample extraction and merged features to reduce the training time. Our results indicate that our method achieves outstanding performance in recognition of DNA methylation. The average AUC values of our method using the 25 human hepatocellular carcinoma cells (HCC) cell datasets and six human heptoplastoma-derived (HepG2) cell datasets were 0.9616 and 0.9213, respectively. Moreover, the average training times for our method on the HCC and HepG2 datasets were 8.3 and 5.06 s, respectively. Furthermore, the computational complexity of our model was much lower compared with other available methods that detect methylation states of the CpG sites. CONCLUSIONS In summary, LightCpG is an accurate model for identifying the DNA methylation status of CpG sites in single cells. Furthermore, three types of feature extraction methods and two strategies used in LightCpG are helpful for other prediction problems.
Collapse
Affiliation(s)
- Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Chongqing Wang
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|