1
|
Kucukakcali Z, Akbulut S, Colak C. Prediction of genomic biomarkers for endometriosis using the transcriptomic dataset. World J Clin Cases 2025; 13:104556. [DOI: 10.12998/wjcc.v13.i20.104556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/23/2024] [Revised: 03/03/2025] [Accepted: 03/13/2025] [Indexed: 04/09/2025] Open
Abstract
BACKGROUND Endometriosis is a clinical condition characterized by the presence of endometrial glands outside the uterine cavity. While its incidence remains mostly uncertain, endometriosis impacts around 180 million women worldwide. Despite the presentation of several epidemiological and clinical explanations, the precise mechanism underlying the disease remains ambiguous. In recent years, researchers have examined the hereditary dimension of the disease. Genetic research has aimed to discover the gene or genes responsible for the disease through association or linkage studies involving candidate genes or DNA mapping techniques.
AIM To identify genetic biomarkers linked to endometriosis by the application of machine learning (ML) approaches.
METHODS This case-control study accounted for the open-access transcriptomic data set of endometriosis and the control group. We included data from 22 controls and 16 endometriosis patients for this purpose. We used AdaBoost, XGBoost, Stochasting Gradient Boosting, Bagged Classification and Regression Trees (CART) for classification using five-fold cross validation. We evaluated the performance of the models using the performance measures of accuracy, balanced accuracy, sensitivity, specificity, positive predictive value, negative predictive value and F1 score.
RESULTS Bagged CART gave the best classification metrics. The metrics obtained from this model are 85.7%, 85.7%, 100%, 75%, 75%, 100% and 85.7% for accuracy, balanced accuracy, sensitivity, specificity, positive predictive value, negative predictive value and F1 score, respectively. Based on the variable importance of modeling, we can use the genes CUX2, CLMP, CEP131, EHD4, CDH24, ILRUN, LINC01709, HOTAIR, SLC30A2 and NKG7 and other transcripts with inaccessible gene names as potential biomarkers for endometriosis.
CONCLUSION This study determined possible genomic biomarkers for endometriosis using transcriptomic data from patients with/without endometriosis. The applied ML model successfully classified endometriosis and created a highly accurate diagnostic prediction model. Future genomic studies could explain the underlying pathology of endometriosis, and a non-invasive diagnostic method could replace the invasive ones.
Collapse
Affiliation(s)
- Zeynep Kucukakcali
- Department of Biostatistics and Medical Informatics, Inonu University Faculty of Medicine, Malatya 44280, Türkiye
| | - Sami Akbulut
- Department of Biostatistics and Medical Informatics, Inonu University Faculty of Medicine, Malatya 44280, Türkiye
- Surgery and Liver Transplant Institute, Inonu University Faculty of Medicine, Malatya 44280, Türkiye
| | - Cemil Colak
- Department of Biostatistics and Medical Informatics, Inonu University Faculty of Medicine, Malatya 44280, Türkiye
| |
Collapse
|
2
|
Sanches PHG, de Melo NC, Porcari AM, de Carvalho LM. Integrating Molecular Perspectives: Strategies for Comprehensive Multi-Omics Integrative Data Analysis and Machine Learning Applications in Transcriptomics, Proteomics, and Metabolomics. BIOLOGY 2024; 13:848. [PMID: 39596803 PMCID: PMC11592251 DOI: 10.3390/biology13110848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 07/19/2024] [Accepted: 07/25/2024] [Indexed: 11/29/2024]
Abstract
With the advent of high-throughput technologies, the field of omics has made significant strides in characterizing biological systems at various levels of complexity. Transcriptomics, proteomics, and metabolomics are the three most widely used omics technologies, each providing unique insights into different layers of a biological system. However, analyzing each omics data set separately may not provide a comprehensive understanding of the subject under study. Therefore, integrating multi-omics data has become increasingly important in bioinformatics research. In this article, we review strategies for integrating transcriptomics, proteomics, and metabolomics data, including co-expression analysis, metabolite-gene networks, constraint-based models, pathway enrichment analysis, and interactome analysis. We discuss combined omics integration approaches, correlation-based strategies, and machine learning techniques that utilize one or more types of omics data. By presenting these methods, we aim to provide researchers with a better understanding of how to integrate omics data to gain a more comprehensive view of a biological system, facilitating the identification of complex patterns and interactions that might be missed by single-omics analyses.
Collapse
Affiliation(s)
- Pedro H. Godoy Sanches
- MS4Life Laboratory of Mass Spectrometry, Health Sciences Postgraduate Program, São Francisco University, Bragança Paulista 12916-900, SP, Brazil
| | - Nicolly Clemente de Melo
- Graduate Program in Biomedicine, São Francisco University, Bragança Paulista 12916-900, SP, Brazil
| | - Andreia M. Porcari
- MS4Life Laboratory of Mass Spectrometry, Health Sciences Postgraduate Program, São Francisco University, Bragança Paulista 12916-900, SP, Brazil
| | - Lucas Miguel de Carvalho
- Post Graduate Program in Health Sciences, São Francisco University, Bragança Paulista 12916-900, SP, Brazil
| |
Collapse
|
3
|
Ambeskovic A, McCall MN, Woodsmith J, Juhl H, Land H. Exon-Skipping-Based Subtyping of Colorectal Cancers. Gastroenterology 2024:S0016-5085(24)05357-5. [PMID: 39181169 DOI: 10.1053/j.gastro.2024.08.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 07/24/2024] [Accepted: 08/14/2024] [Indexed: 08/27/2024]
Abstract
BACKGROUND & AIMS The identification of colorectal cancer (CRC) molecular subtypes has prognostic and potentially diagnostic value for patients, yet reliable subtyping remains unavailable in the clinic. The current consensus molecular subtype (CMS) classification in CRCs is based on complex RNA expression patterns quantified at the gene level. The clinical application of these methods, however, is challenging due to high uncertainty of single-sample classification and associated costs. Alternative splicing, which strongly contributes to transcriptome diversity, has rarely been used for tissue type classification. Here, we present an AS-based CRC subtyping framework sensitive to differential exon use that can be adapted for clinical application. METHODS Unsupervised clustering was used to measure the strength of association between different categories of alternative splicing and CMSs. To build a classifier, the ground truth for CMS labels was derived from expression data quantified at the gene level. Feature selection was achieved through bootstrapping and L1-penalized estimation. The resulting feature space was used to construct a subtype prediction framework applicable to single and multiple samples. The performance of the models was evaluated on unseen CRCs from 2 independent sources (Indivumed, n = 129; The Cancer Genome Atlas, n = 99). RESULTS We developed a CRC subtype identifier based on 29 exon-skipping events that accurately classifies unseen tumors and enables more precise differentiation of subtypes characterized by distinct biological and prognostic features as compared to classifiers based on gene expression. CONCLUSIONS Here, we demonstrate that a small number of exon-skipping events can reliably classify CRC subtypes using individual patient specimens in a manner suitable to clinical application.
Collapse
Affiliation(s)
- Aslihan Ambeskovic
- Department of Biomedical Genetics, University of Rochester Medical Center, Rochester, New York
| | - Matthew N McCall
- Department of Biomedical Genetics, University of Rochester Medical Center, Rochester, New York; Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, New York; Wilmot Cancer Institute, University of Rochester Medical Center, Rochester, New York
| | | | | | - Hartmut Land
- Department of Biomedical Genetics, University of Rochester Medical Center, Rochester, New York; Wilmot Cancer Institute, University of Rochester Medical Center, Rochester, New York.
| |
Collapse
|
4
|
Feng S, Wang Z, Jin Y, Xu S. TabDEG: Classifying differentially expressed genes from RNA-seq data based on feature extraction and deep learning framework. PLoS One 2024; 19:e0305857. [PMID: 39037985 PMCID: PMC11262683 DOI: 10.1371/journal.pone.0305857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 06/05/2024] [Indexed: 07/24/2024] Open
Abstract
Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.
Collapse
Affiliation(s)
- Sifan Feng
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Zhenyou Wang
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Yinghua Jin
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Shengbin Xu
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China
| |
Collapse
|
5
|
Ilangovan H, Kothiyal P, Hoadley KA, Elgart R, Eley G, Eslami P. Harmonizing heterogeneous transcriptomics datasets for machine learning-based analysis to identify spaceflown murine liver-specific changes. NPJ Microgravity 2024; 10:61. [PMID: 38862523 PMCID: PMC11167036 DOI: 10.1038/s41526-024-00379-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Accepted: 03/08/2024] [Indexed: 06/13/2024] Open
Abstract
NASA has employed high-throughput molecular assays to identify sub-cellular changes impacting human physiology during spaceflight. Machine learning (ML) methods hold the promise to improve our ability to identify important signals within highly dimensional molecular data. However, the inherent limitation of study subject numbers within a spaceflight mission minimizes the utility of ML approaches. To overcome the sample power limitations, data from multiple spaceflight missions must be aggregated while appropriately addressing intra- and inter-study variabilities. Here we describe an approach to log transform, scale and normalize data from six heterogeneous, mouse liver-derived transcriptomics datasets (ntotal = 137) which enabled ML-methods to classify spaceflown vs. ground control animals (AUC ≥ 0.87) while mitigating the variability from mission-of-origin. Concordance was found between liver-specific biological processes identified from harmonized ML-based analysis and study-by-study classical omics analysis. This work demonstrates the feasibility of applying ML methods on integrated, heterogeneous datasets of small sample size.
Collapse
Affiliation(s)
- Hari Ilangovan
- Science Applications International Corporation (SAIC), Reston, VA, 20190, USA.
| | | | - Katherine A Hoadley
- Department of Genetics, Computational Medicine Program, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | | | - Greg Eley
- Scimentis LLC, Statham, GA, 30666, USA
| | - Parastou Eslami
- Universal Artificial Intelligence Inc, Boston, MA, 02130, USA
| |
Collapse
|
6
|
Vural-Ozdeniz M, Calisir K, Acar R, Yavuz A, Ozgur MM, Dalgıc E, Konu O. CAP-RNAseq: an integrated pipeline for functional annotation and prioritization of co-expression clusters. Brief Bioinform 2024; 25:bbad536. [PMID: 38279653 PMCID: PMC10818169 DOI: 10.1093/bib/bbad536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 12/04/2023] [Accepted: 12/21/2024] [Indexed: 01/28/2024] Open
Abstract
Cluster analysis is one of the most widely used exploratory methods for visualization and grouping of gene expression patterns across multiple samples or treatment groups. Although several existing online tools can annotate clusters with functional terms, there is no all-in-one webserver to effectively prioritize genes/clusters using gene essentiality as well as congruency of mRNA-protein expression. Hence, we developed CAP-RNAseq that makes possible (1) upload and clustering of bulk RNA-seq data followed by identification, annotation and network visualization of all or selected clusters; and (2) prioritization using DepMap gene essentiality and/or dependency scores as well as the degree of correlation between mRNA and protein levels of genes within an expression cluster. In addition, CAP-RNAseq has an integrated primer design tool for the prioritized genes. Herein, we showed using comparisons with the existing tools and multiple case studies that CAP-RNAseq can uniquely aid in the discovery of co-expression clusters enriched with essential genes and prioritization of novel biomarker genes that exhibit high correlations between their mRNA and protein expression levels. CAP-RNAseq is applicable to RNA-seq data from different contexts including cancer and available at http://konulabapps.bilkent.edu.tr:3838/CAPRNAseq/ and the docker image is downloadable from https://hub.docker.com/r/konulab/caprnaseq.
Collapse
Affiliation(s)
| | - Kubra Calisir
- Department of Molecular Biology and Genetics, Bilkent University, Ankara, Türkiye
| | - Rana Acar
- Department of Molecular Biology and Genetics, Bilkent University, Ankara, Türkiye
| | - Aysenur Yavuz
- Department of Molecular Biology and Genetics, Bilkent University, Ankara, Türkiye
| | - Mustafa M Ozgur
- Department of Molecular Biology and Genetics, Bilkent University, Ankara, Türkiye
| | - Ertugrul Dalgıc
- Department of Medical Biology, School of Medicine, Zonguldak Bülent Ecevit University, Zonguldak, Türkiye
| | - Ozlen Konu
- Department of Neuroscience, Bilkent University, Ankara, Türkiye
- Department of Molecular Biology and Genetics, Bilkent University, Ankara, Türkiye
| |
Collapse
|
7
|
Cascianelli S, Galzerano A, Masseroli M. Supervised Relevance-Redundancy assessments for feature selection in omics-based classification scenarios. J Biomed Inform 2023; 144:104457. [PMID: 37488024 DOI: 10.1016/j.jbi.2023.104457] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 06/05/2023] [Accepted: 07/19/2023] [Indexed: 07/26/2023]
Abstract
BACKGROUND AND OBJECTIVE Many classification tasks in translational bioinformatics and genomics are characterized by the high dimensionality of potential features and unbalanced sample distribution among classes. This can affect classifier robustness and increase the risk of overfitting, curse of dimensionality and generalization leaks; furthermore and most importantly, this can prevent obtaining adequate patient stratification required for precision medicine in facing complex diseases, like cancer. Setting up a feature selection strategy able to extract only proper predictive features by removing irrelevant, redundant, and noisy ones is crucial to achieving valuable results on the desired task. METHODS We propose a new feature selection approach, called ReRa, based on supervised Relevance-Redundancy assessments. ReRa consists of a customized step of relevance-based filtering, to identify a reduced subset of meaningful features, followed by a supervised similarity-based procedure to minimize redundancy. This latter step innovatively uses a combination of global and class-specific similarity assessments to remove redundant features while preserving those differentiated across classes, even when these classes are strongly unbalanced. RESULTS We compared ReRa with several existing feature selection methods to obtain feature spaces on which performing breast cancer patient subtyping using several classifiers: we considered two use cases based on gene or transcript isoform expression. In the vast majority of the assessed scenarios, when using ReRa-selected feature spaces, the performances were significantly increased compared to simple feature filtering, LASSO regularization, or even MRmr - another Relevance-Redundancy method. The two use cases represent an insightful example of translational application, taking advantage of ReRa capabilities to investigate and enhance a clinically-relevant patient stratification task, which could be easily applied also to other cancer types and diseases. CONCLUSIONS ReRa approach has the potential to improve the performance of machine learning models used in an unbalanced classification scenario. Compared to another Relevance-Redundancy approach like MRmr, ReRa does not require tuning the number of preserved features, ensures efficiency and scalability over huge initial dimensionalities and allows re-evaluation of all previously selected features at each iteration of the redundancy assessment, to ultimately preserve only the most relevant and class-differentiated features.
Collapse
Affiliation(s)
- Silvia Cascianelli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci, 32, Milano, 20133, Italy.
| | - Arianna Galzerano
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci, 32, Milano, 20133, Italy
| | - Marco Masseroli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci, 32, Milano, 20133, Italy
| |
Collapse
|
8
|
Shakola F, Palejev D, Ivanov I. A Framework for Comparison and Assessment of Synthetic RNA-Seq Data. Genes (Basel) 2022; 13:2362. [PMID: 36553629 PMCID: PMC9778097 DOI: 10.3390/genes13122362] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/05/2022] [Accepted: 12/06/2022] [Indexed: 12/16/2022] Open
Abstract
The ever-growing number of methods for the generation of synthetic bulk and single cell RNA-seq data have multiple and diverse applications. They are often aimed at benchmarking bioinformatics algorithms for purposes such as sample classification, differential expression analysis, correlation and network studies and the optimization of data integration and normalization techniques. Here, we propose a general framework to compare synthetically generated RNA-seq data and select a data-generating tool that is suitable for a set of specific study goals. As there are multiple methods for synthetic RNA-seq data generation, researchers can use the proposed framework to make an informed choice of an RNA-seq data simulation algorithm and software that are best suited for their specific scientific questions of interest.
Collapse
Affiliation(s)
- Felitsiya Shakola
- GATE Institute, Sofia University, 125 Tsarigradsko Shosse, Bl. 2, 1113 Sofia, Bulgaria
| | - Dean Palejev
- Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Acad. G. Bonchev St., Bl. 8, 1113 Sofia, Bulgaria
| | - Ivan Ivanov
- Department of Veterinary Physiology and Pharmacology, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
9
|
Abdelwahab O, Awad N, Elserafy M, Badr E. A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma. PLoS One 2022; 17:e0269126. [PMID: 36067196 PMCID: PMC9447897 DOI: 10.1371/journal.pone.0269126] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Accepted: 05/15/2022] [Indexed: 12/23/2022] Open
Abstract
Lung cancer (LC) represents most of the cancer incidences in the world. There are many types of LC, but Lung Adenocarcinoma (LUAD) is the most common type. Although RNA-seq and microarray data provide a vast amount of gene expression data, most of the genes are insignificant to clinical diagnosis. Feature selection (FS) techniques overcome the high dimensionality and sparsity issues of the large-scale data. We propose a framework that applies an ensemble of feature selection techniques to identify genes highly correlated to LUAD. Utilizing LUAD RNA-seq data from the Cancer Genome Atlas (TCGA), we employed mutual information (MI) and recursive feature elimination (RFE) feature selection techniques along with support vector machine (SVM) classification model. We have also utilized Random Forest (RF) as an embedded FS technique. The results were integrated and candidate biomarker genes across all techniques were identified. The proposed framework has identified 12 potential biomarkers that are highly correlated with different LC types, especially LUAD. A predictive model has been trained utilizing the identified biomarker expression profiling and performance of 97.99% was achieved. In addition, upon performing differential gene expression analysis, we could find that all 12 genes were significantly differentially expressed between normal and LUAD tissues, and strongly correlated with LUAD according to previous reports. We here propose that using multiple feature selection methods effectively reduces the number of identified biomarkers and directly affects their biological relevance.
Collapse
Affiliation(s)
- Omar Abdelwahab
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
| | - Nourelislam Awad
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Center of Informatics Science, Nile university, Giza, Egypt
| | - Menattallah Elserafy
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Center for Genomics, Helmy Institute for Medical Sciences, Zewail City of Science and Technology, Giza, Egypt
| | - Eman Badr
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
| |
Collapse
|
10
|
Wang S, Li M, Ng SB. Research on Infant Health Diagnosis and Intelligence Development Based on Machine Learning and Health Information Statistics. Front Public Health 2022; 10:846598. [PMID: 35719653 PMCID: PMC9201248 DOI: 10.3389/fpubh.2022.846598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Accepted: 02/22/2022] [Indexed: 11/18/2022] Open
Abstract
Intelligent health diagnosis for young children aims at maintaining and promoting the healthy development of young children, aiming to make young children have a healthy state and provide a better future for their physical and mental health development. The biological basis of intelligence is the structure and function of human brain and the key to improve the intelligence level of infants is to improve the quality of brain development, especially the early development of brain. Based on machine learning and health information statistics, this paper studies the development of infant health diagnosis and intelligence, physical and mental health. Pre-process the sample data, and use the filtering method based on machine learning and health information statistics for feature screening. Compared with traditional statistical methods, machine learning and health information statistical methods can better obtain the hidden information in the big data of children's physical and mental health development, and have better learning ability and generalization ability. The machine learning theory is used to analyze and mine the infant's health diagnosis and intelligence development, establish a health state model, and intuitively show people the health status of their infant's physical and mental health development by means of data. Moreover, the accumulation of these big data is very important in the field of medical and health research driven by big data.
Collapse
Affiliation(s)
- Siyu Wang
- Teachers College, Chengdu University, Chengdu, China
| | - Min Li
- Teachers College, Chengdu University, Chengdu, China
| | | |
Collapse
|
11
|
Pramana S, Hardiyanta IKY, Hidayat FY, Mariyah S. A comparative assessment on gene expression classification methods of RNA-seq data generated using next-generation sequencing (NGS). NARRA J 2022; 2:e60. [PMID: 38450388 PMCID: PMC10914053 DOI: 10.52225/narra.v2i1.60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/13/2021] [Accepted: 03/22/2022] [Indexed: 03/08/2024]
Abstract
Next-generation sequencing or massively parallel sequencing have revolutionized genomic research. RNA sequencing (RNA-Seq) can profile the gene-expression used for molecular diagnosis, disease classification and providing potential markers of diseases. For classification of gene expressions, several methods that have been proposed are based on microarray data which is a continuous scale or require a normal distribution assumption. As the RNA-Seq data do not meet those requirements, these methods cannot be applied directly. In this study, we compare several classifiers including Logistic Regression, Support Vector Machine, Classification and Regression Trees and Random Forest. A simulation study with different parameters such as over dispersion, differential expression rate is conducted and the results are compared with two mRNA experimental datasets. To measure predictive accuracy six performance indicators are used: Percentage Correctly Classified, Area Under Receiver Operating Characteristic (ROC) Curve, Kolmogorov Smirnov Statistics, Partial Gini Index, H-measure and Brier Score. The result shows that Random Forest outperforms the other classification algorithms.
Collapse
Affiliation(s)
| | | | | | - Siti Mariyah
- Politeknik Statistika STIS, Jakarta, Indonesia
- School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
| |
Collapse
|
12
|
Kim J, Yoon Y, Park HJ, Kim YH. Comparative Study of Classification Algorithms for Various DNA Microarray Data. Genes (Basel) 2022; 13:494. [PMID: 35328048 PMCID: PMC8951024 DOI: 10.3390/genes13030494] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 03/07/2022] [Indexed: 12/19/2022] Open
Abstract
Microarrays are applications of electrical engineering and technology in biology that allow simultaneous measurement of expression of numerous genes, and they can be used to analyze specific diseases. This study undertakes classification analyses of various microarrays to compare the performances of classification algorithms over different data traits. The datasets were classified into test and control groups based on five utilized machine learning methods, including MultiLayer Perceptron (MLP), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and k-Nearest Neighbors (KNN), and the resulting accuracies were compared. k-fold cross-validation was used in evaluating the performance and the result was analyzed by comparing the performances of the five machine learning methods. Through the experiments, it was observed that the two tree-based methods, DT and RF, showed similar trends in results and the remaining three methods, MLP, SVM, and DT, showed similar trends. DT and RF generally showed worse performance than other methods except for one dataset. This suggests that, for the effective classification of microarray data, selecting a classification algorithm that is suitable for data traits is crucial to ensure optimum performance.
Collapse
Affiliation(s)
- Jingeun Kim
- Department of IT Convergence Engineering, Gachon University, Seongnam-daero 1342, Seongnam-si 13120, Korea;
| | - Yourim Yoon
- Department of Computer Engineering, College of Information Technology, Gachon University, Seongnam-daero 1342, Sujeong-gu, Seongnam-si 13120, Korea
| | - Hye-Jin Park
- Department of Food Science and Biotechnology, College of BioNano Technology, Gachon University, Seongnam-daero 1342, Sujeong-gu, Seongnam-si 13120, Korea;
| | - Yong-Hyuk Kim
- School of Software, Kwangwoon University, 20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Korea;
- Department of Cell and Regenerative Biology, School of Medicine and Public Health, University of Wisconsin-Madison, 1111 Highland Ave, Madison, WI 53705, USA
| |
Collapse
|
13
|
Using machine learning to detect the differential usage of novel gene isoforms. BMC Bioinformatics 2022; 23:45. [PMID: 35042461 PMCID: PMC8764765 DOI: 10.1186/s12859-022-04576-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 01/10/2022] [Indexed: 11/24/2022] Open
Abstract
Background Differential isoform usage is an important driver of inter-individual phenotypic diversity and is linked to various diseases and traits. However, accurately detecting the differential usage of different gene transcripts between groups can be difficult, in particular in less well annotated genomes where the spectrum of transcript isoforms is largely unknown. Results We investigated whether machine learning approaches can detect differential isoform usage based purely on the distribution of reads across a gene region. We illustrate that gradient boosting and elastic net approaches can successfully identify large numbers of genes showing potential differential isoform usage between Europeans and Africans, that are enriched among relevant biological pathways and significantly overlap those identified by previous approaches. We demonstrate that diversity at the 3′ and 5′ ends of genes are primary drivers of these differences between populations. Conclusion Machine learning methods can effectively detect differential isoform usage from read fraction data, and can provide novel insights into the biological differences between groups. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04576-3.
Collapse
|
14
|
Kakati T, Bhattacharyya DK, Kalita JK, Norden-Krichmar TM. DEGnext: classification of differentially expressed genes from RNA-seq data using a convolutional neural network with transfer learning. BMC Bioinformatics 2022; 23:17. [PMID: 34991439 PMCID: PMC8734099 DOI: 10.1186/s12859-021-04527-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 12/13/2021] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND A limitation of traditional differential expression analysis on small datasets involves the possibility of false positives and false negatives due to sample variation. Considering the recent advances in deep learning (DL) based models, we wanted to expand the state-of-the-art in disease biomarker prediction from RNA-seq data using DL. However, application of DL to RNA-seq data is challenging due to absence of appropriate labels and smaller sample size as compared to number of genes. Deep learning coupled with transfer learning can improve prediction performance on novel data by incorporating patterns learned from other related data. With the emergence of new disease datasets, biomarker prediction would be facilitated by having a generalized model that can transfer the knowledge of trained feature maps to the new dataset. To the best of our knowledge, there is no Convolutional Neural Network (CNN)-based model coupled with transfer learning to predict the significant upregulating (UR) and downregulating (DR) genes from both trained and untrained datasets. RESULTS We implemented a CNN model, DEGnext, to predict UR and DR genes from gene expression data obtained from The Cancer Genome Atlas database. DEGnext uses biologically validated data along with logarithmic fold change values to classify differentially expressed genes (DEGs) as UR and DR genes. We applied transfer learning to our model to leverage the knowledge of trained feature maps to untrained cancer datasets. DEGnext's results were competitive (ROC scores between 88 and 99[Formula: see text]) with those of five traditional machine learning methods: Decision Tree, K-Nearest Neighbors, Random Forest, Support Vector Machine, and XGBoost. DEGnext was robust and effective in terms of transferring learned feature maps to facilitate classification of unseen datasets. Additionally, we validated that the predicted DEGs from DEGnext were mapped to significant Gene Ontology terms and pathways related to cancer. CONCLUSIONS DEGnext can classify DEGs into UR and DR genes from RNA-seq cancer datasets with high performance. This type of analysis, using biologically relevant fine-tuning data, may aid in the exploration of potential biomarkers and can be adapted for other disease datasets.
Collapse
Affiliation(s)
- Tulika Kakati
- Department of Epidemiology and Biostatistics, University of California, Irvine, Irvine, CA, USA.,Department of Computer Science, Tezpur University, Assam, India
| | | | - Jugal K Kalita
- Department of Computer Science, University of Colorado, Colorado Springs, Colorado Springs, CO, USA
| | - Trina M Norden-Krichmar
- Department of Epidemiology and Biostatistics, University of California, Irvine, Irvine, CA, USA.
| |
Collapse
|
15
|
Eshun RB, Kamrul Islam AKM, Bikdash MU. Identification of Significantly Expressed Gene Mutations for Automated Classification of Benign and Malignant Prostate Cancer. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:2437-2443. [PMID: 34891773 DOI: 10.1109/embc46164.2021.9630460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Among males, prostate cancer (Pca) is the cancer type with the highest prevalence and the second leading cause of cancer deaths. The current screening methods for prostate cancer lack effectiveness such as prostate-specific antigen (PSA) and digital rectal exam (DRE). Machine learning models have been used to predict Pca progression, Gleason score, and laterality. In this research paper, we have employed novel Machine learning techniques such as Bayesian approach, Support vector machines (SVM), Decision Trees, Logistic Regression, K-Nearest Neighbors, Random Forest and AdaBoost for detecting malignant prostate cancers from benign ones. Moreover, different feature extracting strategies are proposed to improve the detection performance and identify potential genomic biomarkers. The results show the Lasso feature set yielded high performance from the models with SVM achieving exemplary classification accuracy of 97%. The Lasso and SVM combination reported many significant biomarker genes and gene mutations including but not restricted to CA2320112, CA2328529, and CA2436168.
Collapse
|
16
|
Gupta R, Kleinjans J, Caiment F. Identifying novel transcript biomarkers for hepatocellular carcinoma (HCC) using RNA-Seq datasets and machine learning. BMC Cancer 2021; 21:962. [PMID: 34445986 PMCID: PMC8394105 DOI: 10.1186/s12885-021-08704-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 08/09/2021] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Hepatocellular carcinoma (HCC) is one of the leading causes of cancer death in the world owing to limitations in its prognosis. The current prognosis approaches include radiological examination and detection of serum biomarkers, however, both have limited efficiency and are ineffective in early prognosis. Due to such limitations, we propose to use RNA-Seq data for evaluating putative higher accuracy biomarkers at the transcript level that could help in early prognosis. METHODS To identify such potential transcript biomarkers, RNA-Seq data for healthy liver and various HCC cell models were subjected to five different machine learning algorithms: random forest, K-nearest neighbor, Naïve Bayes, support vector machine, and neural networks. Various metrics, namely sensitivity, specificity, MCC, informedness, and AUC-ROC (except for support vector machine) were evaluated. The algorithms that produced the highest values for all metrics were chosen to extract the top features that were subjected to recursive feature elimination. Through recursive feature elimination, the least number of features were obtained to differentiate between the healthy and HCC cell models. RESULTS From the metrics used, it is demonstrated that the efficiency of the known protein biomarkers for HCC is comparatively lower than complete transcriptomics data. Among the different machine learning algorithms, random forest and support vector machine demonstrated the best performance. Using recursive feature elimination on top features of random forest and support vector machine three transcripts were selected that had an accuracy of 0.97 and kappa of 0.93. Of the three transcripts, two were protein coding (PARP2-202 and SPON2-203) and one was a non-coding transcript (CYREN-211). Lastly, we demonstrated that these three selected transcripts outperformed randomly taken three transcripts (15,000 combinations), hence were not chance findings, and could then be an interesting candidate for new HCC biomarker development. CONCLUSION Using RNA-Seq data combined with machine learning approaches can aid in finding novel transcript biomarkers. The three biomarkers identified: PARP2-202, SPON2-203, and CYREN-211, presented the highest accuracy among all other transcripts in differentiating the healthy and HCC cell models. The machine learning pipeline developed in this study can be used for any RNA-Seq dataset to find novel transcript biomarkers. Code: www.github.com/rajinder4489/ML_biomarkers.
Collapse
Affiliation(s)
- Rajinder Gupta
- Department of Toxicogenomics, School of Oncology and Developmental Biology (GROW), Maastricht University, Maastricht, The Netherlands
| | - Jos Kleinjans
- Department of Toxicogenomics, School of Oncology and Developmental Biology (GROW), Maastricht University, Maastricht, The Netherlands
| | - Florian Caiment
- Department of Toxicogenomics, School of Oncology and Developmental Biology (GROW), Maastricht University, Maastricht, The Netherlands.
| |
Collapse
|
17
|
A deep learning approach to identify gene targets of a therapeutic for human splicing disorders. Nat Commun 2021; 12:3332. [PMID: 34099697 PMCID: PMC8185002 DOI: 10.1038/s41467-021-23663-2] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2020] [Accepted: 05/07/2021] [Indexed: 01/16/2023] Open
Abstract
Pre-mRNA splicing is a key controller of human gene expression. Disturbances in splicing due to mutation lead to dysregulated protein expression and contribute to a substantial fraction of human disease. Several classes of splicing modulator compounds (SMCs) have been recently identified and establish that pre-mRNA splicing represents a target for therapy. We describe herein the identification of BPN-15477, a SMC that restores correct splicing of ELP1 exon 20. Using transcriptome sequencing from treated fibroblast cells and a machine learning approach, we identify BPN-15477 responsive sequence signatures. We then leverage this model to discover 155 human disease genes harboring ClinVar mutations predicted to alter pre-mRNA splicing as targets for BPN-15477. Splicing assays confirm successful correction of splicing defects caused by mutations in CFTR, LIPA, MLH1 and MAPT. Subsequent validations in two disease-relevant cellular models demonstrate that BPN-15477 increases functional protein, confirming the clinical potential of our predictions.
Collapse
|
18
|
A supervised machine learning-based methodology for analyzing dysregulation in splicing machinery: An application in cancer diagnosis. Artif Intell Med 2020; 108:101950. [PMID: 32972670 DOI: 10.1016/j.artmed.2020.101950] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Revised: 08/15/2020] [Accepted: 08/18/2020] [Indexed: 02/06/2023]
Abstract
Deregulated splicing machinery components have shown to be associated with the development of several types of cancer and, therefore, the determination of such alterations can help the development of tumor-specific molecular targets for early prognosis and therapy. Determining such splicing components, however, is not a straightforward task mainly due to the heterogeneity of tumors, the variability across samples, and the fat-short characteristic of genomic datasets. In this work, a supervised machine learning-based methodology is proposed, allowing the determination of subsets of relevant splicing components that best discriminate samples. The methodology comprises three main phases: first, a ranking of features is determined by means of applying feature weighting algorithms that compute the importance of each splicing component; second, the best subset of features that allows the induction of an accurate classifier is determined by means of conducting an effective heuristic search; then the confidence over the induced classifier is assessed by means of explaining the individual predictions and its global behavior. At the end, an extensive experimental study was conducted on a large collection of transcript-based datasets, illustrating the utility and benefit of the proposed methodology for analyzing dysregulation in splicing machinery.
Collapse
|
19
|
Akter S, Xu D, Nagel SC, Bromfield JJ, Pelch KE, Wilshire GB, Joshi T. GenomeForest: An Ensemble Machine Learning Classifier for Endometriosis. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:33-42. [PMID: 32477621 PMCID: PMC7233069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Endometriosis is a complex and high impact disease affecting 176 million women worldwide with diagnostic latency between 4 to 11 years due to lack of a definitive clinical symptom or a minimally invasive diagnostic method. In this study, we developed a new ensemble machine learning classifier based on chromosomal partitioning, named GenomeForest and applied it in classifying the endometriosis vs. the control patients using 38 RNA-seq and 80 enrichment-based DNA-methylation (MBD-seq) datasets, and computed performance assessment with six different experiments. The ensemble machine learning models provided an avenue for identifying several candidate biomarker genes with a very high F1 score; a near perfect F1 score (0.968) for the transcriptomics dataset and a very high F1 score (0.918) for the methylomics dataset. We hope in the future a less invasive biopsy can be used to diagnose endometriosis using the findings from such ensemble machine learning classifiers, as demonstrated in this study.
Collapse
Affiliation(s)
| | - Dong Xu
- Informatics Institute
- Electrical Engineering and Computer Science
- Christopher S. Bond Life Sciences Center
| | - Susan C Nagel
- OB/GYN and Women's Health , University of Missouri, Columbia, MO
| | - John J Bromfield
- OB/GYN and Women's Health , University of Missouri, Columbia, MO
| | | | | | - Trupti Joshi
- Informatics Institute
- Christopher S. Bond Life Sciences Center
- Health Management and Informatics, University of Missouri, Columbia, MO
| |
Collapse
|
20
|
Pathway-guided analysis identifies Myc-dependent alternative pre-mRNA splicing in aggressive prostate cancers. Proc Natl Acad Sci U S A 2020; 117:5269-5279. [PMID: 32086391 PMCID: PMC7071906 DOI: 10.1073/pnas.1915975117] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
We sought to define the landscape of alternative pre-mRNA splicing in prostate cancers and the relationship of exon choice to known cancer driver alterations. To do so, we compiled a metadataset composed of 876 RNA-sequencing (RNA-Seq) samples from five publicly available sources representing a range of prostate phenotypes from normal tissue to drug-resistant metastases. We subjected these samples to exon-level analysis with rMATS-turbo, purpose-built software designed for large-scale analyses of splicing, and identified 13,149 high-confidence cassette exon events with variable incorporation across samples. We then developed a computational framework, pathway enrichment-guided activity study of alternative splicing (PEGASAS), to correlate transcriptional signatures of 50 different cancer driver pathways with these alternative splicing events. We discovered that Myc signaling was correlated with incorporation of a set of 1,039 cassette exons enriched in genes encoding RNA binding proteins. Using a human prostate epithelial transformation assay, we confirmed the Myc regulation of 147 of these exons, many of which introduced frameshifts or encoded premature stop codons. Our results connect changes in alternative pre-mRNA splicing to oncogenic alterations common in prostate and many other cancers. We also establish a role for Myc in regulating RNA splicing by controlling the incorporation of nonsense-mediated decay-determinant exons in genes encoding RNA binding proteins.
Collapse
|
21
|
Fiosina J, Fiosins M, Bonn S. Explainable Deep Learning for Augmentation of Small RNA Expression Profiles. J Comput Biol 2020; 27:234-247. [PMID: 31855058 PMCID: PMC7047095 DOI: 10.1089/cmb.2019.0320] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The lack of well-structured metadata annotations complicates the reusability and interpretation of the growing amount of publicly available RNA expression data. The machine learning-based prediction of metadata (data augmentation) can considerably improve the quality of expression data annotation. In this study, we systematically benchmark deep learning (DL) and random forest (RF)-based metadata augmentation of tissue, age, and sex using small RNA (sRNA) expression profiles. We use 4243 annotated sRNA-Seq samples from the sRNA expression atlas database to train and test the augmentation performance. In general, the DL machine learner outperforms the RF method in almost all tested cases. The average cross-validated prediction accuracy of the DL algorithm for tissues is 96.5%, for sex is 77%, and for age is 77.2%. The average tissue prediction accuracy for a completely new data set is 83.1% (DL) and 80.8% (RF). To understand which sRNAs influence DL predictions, we employ backpropagation-based feature importance scores using the DeepLIFT method, which enable us to obtain information on biological relevance of sRNAs.
Collapse
Affiliation(s)
- Jelena Fiosina
- Clausthal University of Technology, Institute of Informatics, Clausthal-Zellerfeld, Germany
| | - Maksims Fiosins
- German Center for Neurodegenerative Diseases, Tübingen, Germany.,Institute for Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany.,Genevention GmbH, Göttingen, Germany.,Address correspondence to: Dr. Maksims Fiosins, German Center for Neurodegenerative Diseases, Otfried-Müller Str. 23, 72076 Tübingen, Germany
| | - Stefan Bonn
- German Center for Neurodegenerative Diseases, Tübingen, Germany.,Institute for Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| |
Collapse
|
22
|
Klén R, Karhunen M, Elo LL. Likelihood contrasts: a machine learning algorithm for binary classification of longitudinal data. Sci Rep 2020; 10:1016. [PMID: 31974488 PMCID: PMC6978422 DOI: 10.1038/s41598-020-57924-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Accepted: 12/31/2019] [Indexed: 12/02/2022] Open
Abstract
Machine learning methods have gained increased popularity in biomedical research during the recent years. However, very few of them support the analysis of longitudinal data, where several samples are collected from an individual over time. Additionally, most of the available longitudinal machine learning methods assume that the measurements are aligned in time, which is often not the case in real data. Here, we introduce a robust longitudinal machine learning method, named likelihood contrasts (LC), which supports study designs with unaligned time points. Our LC method is a binary classifier, which uses linear mixed models for modelling and log-likelihood for decision making. To demonstrate the benefits of our approach, we compared it with existing methods in four simulated and three real data sets. In each simulated data set, LC was the most accurate method, while the real data sets further supported the robust performance of the method. LC is also computationally efficient and easy to use.
Collapse
Affiliation(s)
- Riku Klén
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.,Turku PET Centre, University of Turku, Turku, Finland
| | - Markku Karhunen
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Laura L Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.
| |
Collapse
|
23
|
Akter S, Xu D, Nagel SC, Bromfield JJ, Pelch K, Wilshire GB, Joshi T. Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data. Front Genet 2019; 10:766. [PMID: 31552087 PMCID: PMC6737999 DOI: 10.3389/fgene.2019.00766] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 07/19/2019] [Indexed: 12/29/2022] Open
Abstract
Endometriosis is a complex and common gynecological disorder yet a poorly understood disease affecting about 176 million women worldwide and causing significant impact on their quality of life and economic burden. Neither a definitive clinical symptom nor a minimally invasive diagnostic method is available, thus leading to an average of 4 to 11 years of diagnostic latency. Discovery of relevant biological patterns from microarray expression or next generation sequencing (NGS) data has been advanced over the last several decades by applying various machine learning tools. We performed machine learning analysis using 38 RNA-seq and 80 enrichment-based DNA methylation (MBD-seq) datasets. We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine, and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: a) implication of three different normalization techniques and b) implication of differential analysis using the generalized linear model (GLM). Several candidate biomarker genes were identified by multiple machine learning experiments including NOTCH3, SNAPC2, B4GALNT1, SMAP2, DDB2, GTF3C5, and PTOV1 from the transcriptomics data analysis and TRPM6, RASSF2, TNIP2, RP3-522J7.6, FGD3, and MFSD14B from the methylomics data analysis. We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.
Collapse
Affiliation(s)
- Sadia Akter
- Informatics Institute, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Informatics Institute, University of Missouri, Columbia, MO, United States
- Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Susan C. Nagel
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | - John J. Bromfield
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | - Katherine Pelch
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | | | - Trupti Joshi
- Informatics Institute, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Health Management and Informatics, University of Missouri, Columbia, MO, United States
| |
Collapse
|
24
|
Al-Shaer AE, Flentke GR, Berres ME, Garic A, Smith SM. Exon level machine learning analyses elucidate novel candidate miRNA targets in an avian model of fetal alcohol spectrum disorder. PLoS Comput Biol 2019; 15:e1006937. [PMID: 30973878 PMCID: PMC6478348 DOI: 10.1371/journal.pcbi.1006937] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Revised: 04/23/2019] [Accepted: 03/11/2019] [Indexed: 12/20/2022] Open
Abstract
Gestational alcohol exposure causes fetal alcohol spectrum disorder (FASD) and is a prominent cause of neurodevelopmental disability. Whole transcriptome sequencing (RNA-Seq) offer insights into mechanisms underlying FASD, but gene-level analysis provides limited information regarding complex transcriptional processes such as alternative splicing and non-coding RNAs. Moreover, traditional analytical approaches that use multiple hypothesis testing with a false discovery rate adjustment prioritize genes based on an adjusted p-value, which is not always biologically relevant. We address these limitations with a novel approach and implemented an unsupervised machine learning model, which we applied to an exon-level analysis to reduce data complexity to the most likely functionally relevant exons, without loss of novel information. This was performed on an RNA-Seq paired-end dataset derived from alcohol-exposed neural fold-stage chick crania, wherein alcohol causes facial deficits recapitulating those of FASD. A principal component analysis along with k-means clustering was utilized to extract exons that deviated from baseline expression. This identified 6857 differentially expressed exons representing 1251 geneIDs; 391 of these genes were identified in a prior gene-level analysis of this dataset. It also identified exons encoding 23 microRNAs (miRNAs) having significantly differential expression profiles in response to alcohol. We developed an RDAVID pipeline to identify KEGG pathways represented by these exons, and separately identified predicted KEGG pathways targeted by these miRNAs. Several of these (ribosome biogenesis, oxidative phosphorylation) were identified in our prior gene-level analysis. Other pathways are crucial to facial morphogenesis and represent both novel (focal adhesion, FoxO signaling, insulin signaling) and known (Wnt signaling) alcohol targets. Importantly, there was substantial overlap between the exomes themselves and the predicted miRNA targets, suggesting these miRNAs contribute to the gene-level expression changes. Our novel application of unsupervised machine learning in conjunction with statistical analyses facilitated the discovery of signaling pathways and miRNAs that inform mechanisms underlying FASD.
Collapse
Affiliation(s)
- Abrar E. Al-Shaer
- Nutrition Research Institute, Department of Nutrition, University of North Carolina at Chapel Hill, Kannapolis, North Carolina, United States of America
| | - George R. Flentke
- Nutrition Research Institute, Department of Nutrition, University of North Carolina at Chapel Hill, Kannapolis, North Carolina, United States of America
| | - Mark E. Berres
- Department of Nutritional Sciences, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Ana Garic
- Department of Nutritional Sciences, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Susan M. Smith
- Nutrition Research Institute, Department of Nutrition, University of North Carolina at Chapel Hill, Kannapolis, North Carolina, United States of America
| |
Collapse
|
25
|
Park E, Pan Z, Zhang Z, Lin L, Xing Y. The Expanding Landscape of Alternative Splicing Variation in Human Populations. Am J Hum Genet 2018; 102:11-26. [PMID: 29304370 PMCID: PMC5777382 DOI: 10.1016/j.ajhg.2017.11.002] [Citation(s) in RCA: 246] [Impact Index Per Article: 35.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Accepted: 11/03/2017] [Indexed: 12/16/2022] Open
Abstract
Alternative splicing is a tightly regulated biological process by which the number of gene products for any given gene can be greatly expanded. Genomic variants in splicing regulatory sequences can disrupt splicing and cause disease. Recent developments in sequencing technologies and computational biology have allowed researchers to investigate alternative splicing at an unprecedented scale and resolution. Population-scale transcriptome studies have revealed many naturally occurring genetic variants that modulate alternative splicing and consequently influence phenotypic variability and disease susceptibility in human populations. Innovations in experimental and computational tools such as massively parallel reporter assays and deep learning have enabled the rapid screening of genomic variants for their causal impacts on splicing. In this review, we describe technological advances that have greatly increased the speed and scale at which discoveries are made about the genetic variation of alternative splicing. We summarize major findings from population transcriptomic studies of alternative splicing and discuss the implications of these findings for human genetics and medicine.
Collapse
Affiliation(s)
- Eddie Park
- Department of Microbiology, Immunology, & Molecular Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Zhicheng Pan
- Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Zijun Zhang
- Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Lan Lin
- Department of Microbiology, Immunology, & Molecular Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Yi Xing
- Department of Microbiology, Immunology, & Molecular Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA; Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|