1
|
Naveed M, Saad Mughal M, Aziz T, Jabeen K, Ali Khan A, Alhomrani M, Alsanie WF, Alamri AS. The Prominence of the Broad-Spectrum Protease inhibitor gene A2ML1 as a potential biomarker in cervical cancer diagnostics using Immunotherapeutic and Multi-Omics approaches. Int Immunopharmacol 2024; 142:113126. [PMID: 39265356 DOI: 10.1016/j.intimp.2024.113126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 08/27/2024] [Accepted: 09/06/2024] [Indexed: 09/14/2024]
Abstract
One of the venereal tumors that threaten human life is cervical cancer. A2ML1 is detected in advanced-stage cancer patients and is found to be strongly associated with cervical cancer. A2ML1 was shown to be substantially expressed in cervical cancer in this study, which used data from the TCGA database. Those with high A2ML1 expression had a lower chance of survival than patients with low A2ML1 expression. Both univariate and multivariate Cox regression analyses were utilized to investigate the relationship between clinical variables and overall survival rates. An investigation into the link between A2ML1 and immune infiltration was subsequently conducted. Utilizing the immune cell database, research was conducted to investigate the dispersion of 24 immune cells and their correlation to A2ML1 expression. In addition to this, the favorable correlation between immune cells and A2ML1 was validated using all three immune cell methodologies. The Genomics of Drug Sensitivity in Cancer database was used to confirm the idea that there is a link between A2ML1 expression and the efficacy of chemotherapy or immunotherapy. The findings demonstrated that A2ML1 is a potential biomarker for cervical cancer diagnostics. This biomarker may be used to chaperone immunotherapy, as well as to explain the elucidates of cervical cancer caused by the immunological microenvironment.
Collapse
Affiliation(s)
- Muhammad Naveed
- Department of Biotechnology, Faculty of Science and Technology, University of Central Punjab, Lahore 54590, Pakistan.
| | - Muhammad Saad Mughal
- Department of Biotechnology, Faculty of Science and Technology, University of Central Punjab, Lahore 54590, Pakistan
| | - Tariq Aziz
- Laboratory of Animal Health Food Hygiene and Quality University of Ioannina Arta Greece.
| | - Khizra Jabeen
- Department of Biotechnology, Faculty of Science and Technology, University of Central Punjab, Lahore 54590, Pakistan
| | - Ayaz Ali Khan
- Department of Biotechnology University of Malakand Chakdara Dir Lower
| | - Majid Alhomrani
- Department of Clinical Laboratory Sciences, The faculty of Applied Medical Sciences, Taif University, Taif, Saudi Arabia
| | - Walaa F Alsanie
- Department of Clinical Laboratory Sciences, The faculty of Applied Medical Sciences, Taif University, Taif, Saudi Arabia
| | - Abdulhakeem S Alamri
- Department of Clinical Laboratory Sciences, The faculty of Applied Medical Sciences, Taif University, Taif, Saudi Arabia
| |
Collapse
|
2
|
Bakir-Gungor B, Temiz M, Inal Y, Cicekyurt E, Yousef M. CCPred: Global and population-specific colorectal cancer prediction and metagenomic biomarker identification at different molecular levels using machine learning techniques. Comput Biol Med 2024; 182:109098. [PMID: 39293338 DOI: 10.1016/j.compbiomed.2024.109098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Revised: 08/29/2024] [Accepted: 08/31/2024] [Indexed: 09/20/2024]
Abstract
Colorectal cancer (CRC) ranks as the third most common cancer globally and the second leading cause of cancer-related deaths. Recent research highlights the pivotal role of the gut microbiota in CRC development and progression. Understanding the complex interplay between disease development and metagenomic data is essential for CRC diagnosis and treatment. Current computational models employ machine learning to identify metagenomic biomarkers associated with CRC, yet there is a need to improve their accuracy through a holistic biological knowledge perspective. This study aims to evaluate CRC-associated metagenomic data at species, enzymes, and pathway levels via conducting global and population-specific analyses. These analyses utilize relative abundance values from human gut microbiome sequencing data and robust classification models are built for disease prediction and biomarker identification. For global CRC prediction and biomarker identification, the features that are identified by SelectKBest (SKB), Information Gain (IG), and Extreme Gradient Boosting (XGBoost) methods are combined. Population-based analysis includes within-population, leave-one-dataset-out (LODO) and cross-population approaches. Four classification algorithms are employed for CRC classification. Random Forest achieved an AUC of 0.83 for species data, 0.78 for enzyme data and 0.76 for pathway data globally. On the global scale, potential taxonomic biomarkers include ruthenibacterium lactatiformanas; enzyme biomarkers include RNA 2' 3' cyclic 3' phosphodiesterase; and pathway biomarkers include pyruvate fermentation to acetone pathway. This study underscores the potential of machine learning models trained on metagenomic data for improved disease prediction and biomarker discovery. The proposed model and associated files are available at https://github.com/TemizMus/CCPRED.
Collapse
Affiliation(s)
- Burcu Bakir-Gungor
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, 38080, Turkey
| | - Mustafa Temiz
- Department of Electrical and Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, 38080, Turkey.
| | - Yasin Inal
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, 38080, Turkey
| | - Emre Cicekyurt
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, 38080, Turkey
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, 13206, Israel; Galilee Digital Health Research Center (GDH), Zefat Academic College, Israel
| |
Collapse
|
3
|
Voskergian D, Jayousi R, Yousef M. Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach. Sci Rep 2024; 14:23516. [PMID: 39384798 PMCID: PMC11464685 DOI: 10.1038/s41598-024-74022-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Accepted: 09/23/2024] [Indexed: 10/11/2024] Open
Abstract
TextNetTopics (Yousef et al. in Front Genet 13:893378, 2022. https://doi.org/10.3389/fgene.2022.893378 ) is a recently developed approach that performs text classification-based topics (a topic is a group of terms or words) extracted from a Latent Dirichlet Allocation topic modeling as features rather than individual words. Following this approach enables TextNetTopics to fulfill dimensionality reduction while preserving and embedding more thematic and semantic information into the text document representations. In this article, we introduced a novel approach, the Ensemble Topic Model for Topic Selection (ENTM-TS), an advancement of TextNetTopics. ENTM-TS integrates multiple topic models using the Grouping, Scoring, and Modeling approach, thereby mitigating the performance variability introduced by employing individual topic modeling methods within TextNetTopics. Additionally, we performed a thorough comparative study to evaluate TextNetTopics' performance using eleven state-of-the-art topic modeling algorithms. We used the extracted topics for each as input to the G component in the TextNetTopics tool to select the most compelling topic model regarding their predictive behavior for text classification. We conducted our comprehensive evaluation utilizing the Drug-Induced Liver Injury textual dataset from the CAMDA community and the WOS-5736 dataset. The experimental results show that the Latent Semantic Indexing provides comparable performance measures with fewer discriminative features when compared with other topic modeling methods. Moreover, our evaluation reveals that the performance of ENTM-TS surpasses or aligns with the optimal outcomes obtained from individual topic models across the two datasets, establishing it as a robust and effective enhancement in text classification tasks.
Collapse
Affiliation(s)
- Daniel Voskergian
- Computer Engineering Department, Al-Quds University, Jerusalem, Palestine.
| | - Rashid Jayousi
- Computer Science Department, Al-Quds University, Jerusalem, Palestine
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel.
| |
Collapse
|
4
|
Qumsiyeh E, Salah Z, Yousef M. miRGediNET: A comprehensive examination of common genes in miRNA-Target interactions and disease associations: Insights from a grouping-scoring-modeling approach. Heliyon 2023; 9:e22666. [PMID: 38090011 PMCID: PMC10711121 DOI: 10.1016/j.heliyon.2023.e22666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 11/15/2023] [Accepted: 11/16/2023] [Indexed: 06/15/2024] Open
Abstract
In the broad and complex field of biological data analysis, researchers frequently gather information from a single source or database. Despite being a widespread practice, this has disadvantages. Relying exclusively on a single source can limit our comprehension as it may omit various perspectives that could be obtained by combining multiple knowledge bases. Acknowledging this shortcoming, we report on miRGediNET, a novel approach combining information from three biological databases. Our investigation focuses on microRNAs (miRNAs), small non-coding RNA molecules that regulate gene expression post-transcriptionally. We delve deeply into the knowledge of these miRNA's interactions with genes and the possible effects these interactions may have on different diseases. The scientific community has long recognized a direct correlation between the progression of specific diseases and miRNAs, as well as the genes they target. By using miRGediNET, we go beyond simply acknowledging this relationship. Rather, we actively look for the critical genes that could act as links between the actions of miRNAs and the mechanisms underlying disease. Our methodology, which carefully identifies and investigates these important genes, is supported by a strategic framework that may open up new possibilities for comprehending diseases and creating treatments. We have developed a tool on the Knime platform as a concrete application of our research. This tool serves as both a validation of our study and an invitation to the larger community to interact with, investigate, and build upon our findings. miRGediNET is publicly accessible on GitHub at https://github.com/malikyousef/miRGediNET, providing a collaborative environment for additional research and innovation for enthusiasts and fellow researchers.
Collapse
Affiliation(s)
- Emma Qumsiyeh
- Department of Computer Science and Information Technology, Al-Quds University, Palestine
| | - Zaidoun Salah
- Molecular Genetics and Genetic Toxicology, Arab American University, Ramallah, Palestine
| | - Malik Yousef
- Information Technology Engineering, Al-Quds University, Abu Dis, Palestine
| |
Collapse
|
5
|
Voskergian D, Bakir-Gungor B, Yousef M. TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information. Front Genet 2023; 14:1243874. [PMID: 37867598 PMCID: PMC10585361 DOI: 10.3389/fgene.2023.1243874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 08/30/2023] [Indexed: 10/24/2023] Open
Abstract
With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles' content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.
Collapse
Affiliation(s)
- Daniel Voskergian
- Computer Engineering Department, Faculty of Engineering, Al-Quds University, Jerusalem, Palestine
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Türkiye
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
| |
Collapse
|
6
|
Ersoz NS, Bakir-Gungor B, Yousef M. GeNetOntology: identifying affected gene ontology terms via grouping, scoring, and modeling of gene expression data utilizing biological knowledge-based machine learning. Front Genet 2023; 14:1139082. [PMID: 37671046 PMCID: PMC10476493 DOI: 10.3389/fgene.2023.1139082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 07/05/2023] [Indexed: 09/07/2023] Open
Abstract
Introduction: Identifying significant sets of genes that are up/downregulated under specific conditions is vital to understand disease development mechanisms at the molecular level. Along this line, in order to analyze transcriptomic data, several computational feature selection (i.e., gene selection) methods have been proposed. On the other hand, uncovering the core functions of the selected genes provides a deep understanding of diseases. In order to address this problem, biological domain knowledge-based feature selection methods have been proposed. Unlike computational gene selection approaches, these domain knowledge-based methods take the underlying biology into account and integrate knowledge from external biological resources. Gene Ontology (GO) is one such biological resource that provides ontology terms for defining the molecular function, cellular component, and biological process of the gene product. Methods: In this study, we developed a tool named GeNetOntology which performs GO-based feature selection for gene expression data analysis. In the proposed approach, the process of Grouping, Scoring, and Modeling (G-S-M) is used to identify significant GO terms. GO information has been used as the grouping information, which has been embedded into a machine learning (ML) algorithm to select informative ontology terms. The genes annotated with the selected ontology terms have been used in the training part to carry out the classification task of the ML model. The output is an important set of ontologies for the two-class classification task applied to gene expression data for a given phenotype. Results: Our approach has been tested on 11 different gene expression datasets, and the results showed that GeNetOntology successfully identified important disease-related ontology terms to be used in the classification model. Discussion: GeNetOntology will assist geneticists and scientists to identify a range of disease-related genes and ontologies in transcriptomic data analysis, and it will also help doctors design diagnosis platforms and improve patient treatment plans.
Collapse
Affiliation(s)
- Nur Sebnem Ersoz
- Department of Bioengineering, Graduate School of Engineering and Science, Abdullah Gul University, Kayseri, Türkiye
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Türkiye
- Department of Bioengineering, Faculty of Life and Natural Sciences, Abdullah Gul University, Kayseri, Türkiye
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| |
Collapse
|
7
|
Kuzudisli C, Bakir-Gungor B, Bulut N, Qaqish B, Yousef M. Review of feature selection approaches based on grouping of features. PeerJ 2023; 11:e15666. [PMID: 37483989 PMCID: PMC10358338 DOI: 10.7717/peerj.15666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 06/08/2023] [Indexed: 07/25/2023] Open
Abstract
With the rapid development in technology, large amounts of high-dimensional data have been generated. This high dimensionality including redundancy and irrelevancy poses a great challenge in data analysis and decision making. Feature selection (FS) is an effective way to reduce dimensionality by eliminating redundant and irrelevant data. Most traditional FS approaches score and rank each feature individually; and then perform FS either by eliminating lower ranked features or by retaining highly-ranked features. In this review, we discuss an emerging approach to FS that is based on initially grouping features, then scoring groups of features rather than scoring individual features. Despite the presence of reviews on clustering and FS algorithms, to the best of our knowledge, this is the first review focusing on FS techniques based on grouping. The typical idea behind FS through grouping is to generate groups of similar features with dissimilarity between groups, then select representative features from each cluster. Approaches under supervised, unsupervised, semi supervised and integrative frameworks are explored. The comparison of experimental results indicates the effectiveness of sequential, optimization-based (i.e., fuzzy or evolutionary), hybrid and multi-method approaches. When it comes to biological data, the involvement of external biological sources can improve analysis results. We hope this work's findings can guide effective design of new FS approaches using feature grouping.
Collapse
Affiliation(s)
- Cihan Kuzudisli
- Department of Computer Engineering, Hasan Kalyoncu University, Gaziantep, Turkey
- Department of Electrical and Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Nurten Bulut
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Bahjat Qaqish
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, Chapel Hill, United States of America
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
| |
Collapse
|
8
|
Whittaker CA, Kucukural A, Gates C, Wilkins OM, Bell GW, Hutchinson JN, Polson SW, Dragon J. Functional Annotation Routines Used by ABRF Bioinformatics Core Facilities - Observations, Comparisons, and Considerations. J Biomol Tech 2023; 34:3fc1f5fe.0b74b9db. [PMID: 37089874 PMCID: PMC10121236 DOI: 10.7171/3fc1f5fe.0b74b9db] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
The functional annotation of gene lists is a common analysis routine required for most genomics experiments, and bioinformatics core facilities must support these analyses. In contrast to methods such as the quantitation of RNA-Seq reads or differential expression analysis, our research group noted a lack of consensus in our preferred approaches to functional annotation. To investigate this observation, we selected 4 experiments that represent a range of experimental designs encountered by our cores and analyzed those data with 6 tools used by members of the Association of Biomolecular Resource Facilities (ABRF) Genomic Bioinformatics Research Group (GBIRG). To facilitate comparisons between tools, we focused on a single biological result for each experiment. These results were represented by a gene set, and we analyzed these gene sets with each tool considered in our study to map the result to the annotation categories presented by each tool. In most cases, each tool produces data that would facilitate identification of the selected biological result for each experiment. For the exceptions, Fisher's exact test parameters could be adjusted to detect the result. Because Fisher's exact test is used by many functional annotation tools, we investigated input parameters and demonstrate that, while background set size is unlikely to have a significant impact on the results, the numbers of differentially expressed genes in an annotation category and the total number of differentially expressed genes under consideration are both critical parameters that may need to be modified during analyses. In addition, we note that differences in the annotation categories tested by each tool, as well as the composition of those categories, can have a significant impact on results.
Collapse
Affiliation(s)
- Charles A. Whittaker
- Barbara K. Ostrom (1978) Bioinformatics and Computing Core FacilitySwanson Biotechnology CenterKoch Institute at the Massachusetts Institute of TechnologyCambridgeMassachusetts02139USA
| | - Alper Kucukural
- Bioinformatics CoreUniversity of Massachusetts Medical SchoolWorcesterMassachusetts01605USA
| | - Chris Gates
- BRCF Bioinformatics CoreUniversity of MichiganAnn ArborMichigan48109USA
| | - Owen Michael Wilkins
- Department of Biomedical Data ScienceGeisel School of Medicine at DartmouthHanoverNew Hampshire03755USA
- Dartmouth Cancer CenterDartmouth Hitchcock Medical CenterLebanonNew Hampshire03756USA
| | - George W. Bell
- Bioinformatics and Research ComputingWhitehead InstituteCambridgeMassachusetts02142USA
| | - John N. Hutchinson
- Harvard T.H. Chan School of Public HealthDepartment of BiostatisticsBostonMassachusetts02115USA
| | - Shawn W. Polson
- Bioinformatics CoreCenter for Bioinformatics and Computational BiologyUniversity of DelawareDelaware Biotechnology InstituteNewarkDelaware19713USA
| | - Julie Dragon
- Vermont Integrative Genomics Resource and Vermont Biomedical Research Network Bioinformatic CoreUniversity of VermontBurlingtonVermont05405USA
| |
Collapse
|
9
|
Unlu Yazici M, Marron JS, Bakir-Gungor B, Zou F, Yousef M. Invention of 3Mint for feature grouping and scoring in multi-omics. Front Genet 2023; 14:1093326. [PMID: 37007972 PMCID: PMC10050723 DOI: 10.3389/fgene.2023.1093326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 02/27/2023] [Indexed: 03/17/2023] Open
Abstract
Advanced genomic and molecular profiling technologies accelerated the enlightenment of the regulatory mechanisms behind cancer development and progression, and the targeted therapies in patients. Along this line, intense studies with immense amounts of biological information have boosted the discovery of molecular biomarkers. Cancer is one of the leading causes of death around the world in recent years. Elucidation of genomic and epigenetic factors in Breast Cancer (BRCA) can provide a roadmap to uncover the disease mechanisms. Accordingly, unraveling the possible systematic connections between-omics data types and their contribution to BRCA tumor progression is crucial. In this study, we have developed a novel machine learning (ML) based integrative approach for multi-omics data analysis. This integrative approach combines information from gene expression (mRNA), microRNA (miRNA) and methylation data. Due to the complexity of cancer, this integrated data is expected to improve the prediction, diagnosis and treatment of disease through patterns only available from the 3-way interactions between these 3-omics datasets. In addition, the proposed method bridges the interpretation gap between the disease mechanisms that drive onset and progression. Our fundamental contribution is the 3 Multi-omics integrative tool (3Mint). This tool aims to perform grouping and scoring of groups using biological knowledge. Another major goal is improved gene selection via detection of novel groups of cross-omics biomarkers. Performance of 3Mint is assessed using different metrics. Our computational performance evaluations showed that the 3Mint classifies the BRCA molecular subtypes with lower number of genes when compared to the miRcorrNet tool which uses miRNA and mRNA gene expression profiles in terms of similar performance metrics (95% Accuracy). The incorporation of methylation data in 3Mint yields a much more focused analysis. The 3Mint tool and all other supplementary files are available at https://github.com/malikyousef/3Mint/.
Collapse
Affiliation(s)
- Miray Unlu Yazici
- Department of Bioengineering, Abdullah Gül University, Kayseri, Türkiye
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC, United States
| | - Burcu Bakir-Gungor
- Department of Bioengineering, Abdullah Gül University, Kayseri, Türkiye
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Türkiye
| | - Fei Zou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
- *Correspondence: Malik Yousef,
| |
Collapse
|
10
|
Yousef M, Ozdemir F, Jaber A, Allmer J, Bakir-Gungor B. PriPath: identifying dysregulated pathways from differential gene expression via grouping, scoring, and modeling with an embedded feature selection approach. BMC Bioinformatics 2023; 24:60. [PMID: 36823571 PMCID: PMC9947447 DOI: 10.1186/s12859-023-05187-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 02/14/2023] [Indexed: 02/25/2023] Open
Abstract
BACKGROUND Cell homeostasis relies on the concerted actions of genes, and dysregulated genes can lead to diseases. In living organisms, genes or their products do not act alone but within networks. Subsets of these networks can be viewed as modules that provide specific functionality to an organism. The Kyoto encyclopedia of genes and genomes (KEGG) systematically analyzes gene functions, proteins, and molecules and combines them into pathways. Measurements of gene expression (e.g., RNA-seq data) can be mapped to KEGG pathways to determine which modules are affected or dysregulated in the disease. However, genes acting in multiple pathways and other inherent issues complicate such analyses. Many current approaches may only employ gene expression data and need to pay more attention to some of the existing knowledge stored in KEGG pathways for detecting dysregulated pathways. New methods that consider more precompiled information are required for a more holistic association between gene expression and diseases. RESULTS PriPath is a novel approach that transfers the generic process of grouping and scoring, followed by modeling to analyze gene expression with KEGG pathways. In PriPath, KEGG pathways are utilized as the grouping function as part of a machine learning algorithm for selecting the most significant KEGG pathways. A machine learning model is trained to differentiate between diseases and controls using those groups. We have tested PriPath on 13 gene expression datasets of various cancers and other diseases. Our proposed approach successfully assigned biologically and clinically relevant KEGG terms to the samples based on the differentially expressed genes. We have comparatively evaluated the performance of PriPath against other tools, which are similar in their merit. For each dataset, we manually confirmed the top results of PriPath in the literature and found that most predictions can be supported by previous experimental research. CONCLUSIONS PriPath can thus aid in determining dysregulated pathways, which applies to medical diagnostics. In the future, we aim to advance this approach so that it can perform patient stratification based on gene expression and identify druggable targets. Thereby, we cover two aspects of precision medicine.
Collapse
Affiliation(s)
- Malik Yousef
- Department of Information Systems, Zefat Academic College, 13206, Zefat, Israel. .,Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel.
| | - Fatma Ozdemir
- grid.440414.10000 0004 0558 2628Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey ,grid.5570.70000 0004 0490 981XUniversity Institute of Digital Communication Systems, Ruhr-University, Bochum, Germany
| | - Amhar Jaber
- grid.440414.10000 0004 0558 2628Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Jens Allmer
- grid.454318.f0000 0004 0431 5034Medical Informatics and Bioinformatics, Institute for Measurement Engineering and Sensor Technology, Hochschule Ruhr West, University of Applied Sciences, Mülheim an der Ruhr, Germany
| | - Burcu Bakir-Gungor
- grid.440414.10000 0004 0558 2628Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
| |
Collapse
|
11
|
Jabeer A, Temiz M, Bakir-Gungor B, Yousef M. miRdisNET: Discovering microRNA biomarkers that are associated with diseases utilizing biological knowledge-based machine learning. Front Genet 2023; 13:1076554. [PMID: 36712859 PMCID: PMC9877296 DOI: 10.3389/fgene.2022.1076554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 12/30/2022] [Indexed: 01/14/2023] Open
Abstract
During recent years, biological experiments and increasing evidence have shown that microRNAs play an important role in the diagnosis and treatment of human complex diseases. Therefore, to diagnose and treat human complex diseases, it is necessary to reveal the associations between a specific disease and related miRNAs. Although current computational models based on machine learning attempt to determine miRNA-disease associations, the accuracy of these models need to be improved, and candidate miRNA-disease relations need to be evaluated from a biological perspective. In this paper, we propose a computational model named miRdisNET to predict potential miRNA-disease associations. Specifically, miRdisNET requires two types of data, i.e., miRNA expression profiles and known disease-miRNA associations as input files. First, we generate subsets of specific diseases by applying the grouping component. These subsets contain miRNA expressions with class labels associated with each specific disease. Then, we assign an importance score to each group by using a machine learning method for classification. Finally, we apply a modeling component and obtain outputs. One of the most important outputs of miRdisNET is the performance of miRNA-disease prediction. Compared with the existing methods, miRdisNET obtained the highest AUC value of .9998. Another output of miRdisNET is a list of significant miRNAs for disease under study. The miRNAs identified by miRdisNET are validated via referring to the gold-standard databases which hold information on experimentally verified microRNA-disease associations. miRdisNET has been developed to predict candidate miRNAs for new diseases, where miRNA-disease relation is not yet known. In addition, miRdisNET presents candidate disease-disease associations based on shared miRNA knowledge. The miRdisNET tool and other supplementary files are publicly available at: https://github.com/malikyousef/miRdisNET.
Collapse
Affiliation(s)
- Amhar Jabeer
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Mustafa Temiz
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| |
Collapse
|
12
|
Tyagi P, Singh D, Mathur S, Singh A, Ranjan R. Upcoming progress of transcriptomics studies on plants: An overview. FRONTIERS IN PLANT SCIENCE 2022; 13:1030890. [PMID: 36589087 PMCID: PMC9798009 DOI: 10.3389/fpls.2022.1030890] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Accepted: 10/27/2022] [Indexed: 06/17/2023]
Abstract
Transcriptome sequencing or RNA-Sequencing is a high-resolution, sensitive and high-throughput next-generation sequencing (NGS) approach used to study non-model plants and other organisms. In other words, it is an assembly of RNA transcripts from individual or whole samples of functional and developmental stages. RNA-Seq is a significant technique for identifying gene predictions and mining functional analysis that improves gene ontology understanding mechanisms of biological processes, molecular functions, and cellular components, but there is limited information available on this topic. Transcriptomics research on different types of plants can assist researchers to understand functional genes in better ways and regulatory processes to improve breeding selection and cultivation practices. In recent years, several advancements in RNA-Seq technology have been made for the characterization of the transcriptomes of distinct cell types in biological tissues in an efficient manner. RNA-Seq technologies are briefly introduced and examined in terms of their scientific applications. In a nutshell, it introduces all transcriptome sequencing and analysis techniques, as well as their applications in plant biology research. This review will focus on numerous existing and forthcoming strategies for improving transcriptome sequencing technologies for functional gene mining in various plants using RNA- Seq technology, based on the principles, development, and applications.
Collapse
|
13
|
Qumsiyeh E, Showe L, Yousef M. GediNET for discovering gene associations across diseases using knowledge based machine learning approach. Sci Rep 2022; 12:19955. [PMID: 36402891 PMCID: PMC9675776 DOI: 10.1038/s41598-022-24421-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 11/15/2022] [Indexed: 11/21/2022] Open
Abstract
The most common approaches to discovering genes associated with specific diseases are based on machine learning and use a variety of feature selection techniques to identify significant genes that can serve as biomarkers for a given disease. More recently, the integration in this process of prior knowledge-based approaches has shown significant promise in the discovery of new biomarkers with potential translational applications. In this study, we developed a novel approach, GediNET, that integrates prior biological knowledge to gene Groups that are shown to be associated with a specific disease such as a cancer. The novelty of GediNET is that it then also allows the discovery of significant associations between that specific disease and other diseases. The initial step in this process involves the identification of gene Groups. The Groups are then subjected to a Scoring component to identify the top performing classification Groups. The top-ranked gene Groups are then used to train a Machine Learning Model. The process of Grouping, Scoring and Modelling (G-S-M) is used by GediNET to identify other diseases that are similarly associated with this signature. GediNET identifies these relationships through Disease-Disease Association (DDA) based machine learning. DDA explores novel associations between diseases and identifies relationships which could be used to further improve approaches to diagnosis, prognosis, and treatment. The GediNET KNIME workflow can be downloaded from: https://github.com/malikyousef/GediNET.git or https://kni.me/w/3kH1SQV_mMUsMTS .
Collapse
Affiliation(s)
- Emma Qumsiyeh
- Information Technology Engineering, Al-Quds University, Abu Dis, Palestine.
| | - Louise Showe
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, 13206, Zefat, Israel.
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel.
| |
Collapse
|
14
|
Yousef M, Voskergian D. TextNetTopics: Text Classification Based Word Grouping as Topics and Topics’ Scoring. Front Genet 2022; 13:893378. [PMID: 35795215 PMCID: PMC9251539 DOI: 10.3389/fgene.2022.893378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 05/25/2022] [Indexed: 11/28/2022] Open
Abstract
Medical document classification is one of the active research problems and the most challenging within the text classification domain. Medical datasets often contain massive feature sets where many features are considered irrelevant, redundant, and add noise, thus, reducing the classification performance. Therefore, to obtain a better accuracy of a classification model, it is crucial to choose a set of features (terms) that best discriminate between the classes of medical documents. This study proposes TextNetTopics, a novel approach that applies feature selection by considering Bag-of-topics (BOT) rather than the traditional approach, Bag-of-words (BOW). Thus our approach performs topic selections rather than words selection. TextNetTopics is based on the generic approach entitled G-S-M (Grouping, Scoring, and Modeling), developed by Yousef and his colleagues and used mainly in biological data. The proposed approach suggests scoring topics to select the top topics for training the classifier. This study applied TextNetTopics to textual data to respond to the CAMDA challenge. TextNetTopics outperforms various feature selection approaches while highly performing when applying the model to the validation data provided by the CAMDA. Additionally, we have applied our algorithm to different textual datasets.
Collapse
Affiliation(s)
- Malik Yousef
- Zefat Academic College, Zefat, Israel
- *Correspondence: Malik Yousef, ; Daniel Voskergian,
| | - Daniel Voskergian
- Computer Engineering Department, Al-Quds University, Jerusalem, Palestine
- *Correspondence: Malik Yousef, ; Daniel Voskergian,
| |
Collapse
|
15
|
Differentially Expressed Genes Reveal the Biomarkers and Molecular Mechanism of Osteonecrosis. JOURNAL OF HEALTHCARE ENGINEERING 2022; 2022:8684137. [PMID: 35035862 PMCID: PMC8759865 DOI: 10.1155/2022/8684137] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 12/06/2021] [Accepted: 12/07/2021] [Indexed: 11/18/2022]
Abstract
Osteonecrosis is one of the most refractory orthopedic diseases, which seriously threatens the health of old patients. High-throughput sequencing (HTS) and microarray analysis have confirmed as an effective way for investigating the pathological mechanism of disease. In this study, GSE7716, GSE74089, and GSE123568 were obtained from Gene Expression Omnibus (GEO) database and used to identify differentially expressed genes (DEGs) by R language. Subsequently, the DEGs were analyzed with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment. Moreover, the protein-protein interaction (PPI) network of DEGs was analyzed by STRING database and Cytoscape. The results showed that 318 downregulated genes and 58 upregulated genes were observed in GSE7116; 690 downregulated genes and 1148 upregulated genes were screened from 34183 genes in GSE74089. The DEGs involved in progression of osteonecrosis involved inflammation, immunological rejection, and bacterial infection-related pathways. The GO enrichment showed that osteonecrosis was related with extracellular matrix, external encapsulating structure organization, skeletal system development, immune response activity, cell apoptosis, mononuclear cell differentiation, and serine/threonine kinase activity. Moreover, PPI network showed that the progression of osteonecrosis of the femoral head was related with CCND1, CDH1, ESR1, SPP1, LOX, JUN, ITGA, ABL1, and VEGF, and osteonecrosis of the jaw is related with ACTB, CXCR4, PTPRC, IL1B, CXCL8, TNF, JUN, PTGS2, FOS, and RHOA. In conclusion, this study identified the hub factors and pathways which might play important roles in progression of osteonecrosis and could be used as potential biomarkers for diagnosis and treatment of osteonecrosis.
Collapse
|
16
|
Yousef M, Goy G, Bakir-Gungor B. miRModuleNet: Detecting miRNA-mRNA Regulatory Modules. Front Genet 2022; 13:767455. [PMID: 35495139 PMCID: PMC9039401 DOI: 10.3389/fgene.2022.767455] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Accepted: 03/24/2022] [Indexed: 12/13/2022] Open
Abstract
Increasing evidence that microRNAs (miRNAs) play a key role in carcinogenesis has revealed the need for elucidating the mechanisms of miRNA regulation and the roles of miRNAs in gene-regulatory networks. A better understanding of the interactions between miRNAs and their mRNA targets will provide a better understanding of the complex biological processes that occur during carcinogenesis. Increased efforts to reveal these interactions have led to the development of a variety of tools to detect and understand these interactions. We have recently described a machine learning approach miRcorrNet, based on grouping and scoring (ranking) groups of genes, where each group is associated with a miRNA and the group members are genes with expression patterns that are correlated with this specific miRNA. The miRcorrNet tool requires two types of -omics data, miRNA and mRNA expression profiles, as an input file. In this study we describe miRModuleNet, which groups mRNA (genes) that are correlated with each miRNA to form a star shape, which we identify as a miRNA-mRNA regulatory module. A scoring procedure is then applied to each module to further assess their contribution in terms of classification. An important output of miRModuleNet is that it provides a hierarchical list of significant miRNA-mRNA regulatory modules. miRModuleNet was further validated on external datasets for their disease associations, and functional enrichment analysis was also performed. The application of miRModuleNet aids the identification of functional relationships between significant biomarkers and reveals essential pathways involved in cancer pathogenesis. The miRModuleNet tool and all other supplementary files are available at https://github.com/malikyousef/miRModuleNet/
Collapse
Affiliation(s)
- Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- *Correspondence: Malik Yousef,
| | - Gokhan Goy
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
- The Scientific and Technological Research Council of Turkey, Ankara, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
| |
Collapse
|
17
|
Prediction of Linear Cationic Antimicrobial Peptides Active against Gram-Negative and Gram-Positive Bacteria Based on Machine Learning Models. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12073631] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Antimicrobial peptides (AMPs) are considered as promising alternatives to conventional antibiotics in order to overcome the growing problems of antibiotic resistance. Computational prediction approaches receive an increasing interest to identify and design the best candidate AMPs prior to the in vitro tests. In this study, we focused on the linear cationic peptides with non-hemolytic activity, which are downloaded from the Database of Antimicrobial Activity and Structure of Peptides (DBAASP). Referring to the MIC (Minimum inhibition concentration) values, we have assigned a positive label to a peptide if it shows antimicrobial activity; otherwise, the peptide is labeled as negative. Here, we focused on the peptides showing antimicrobial activity against Gram-negative and against Gram-positive bacteria separately, and we created two datasets accordingly. Ten different physico-chemical properties of the peptides are calculated and used as features in our study. Following data exploration and data preprocessing steps, a variety of classification algorithms are used with 100-fold Monte Carlo Cross-Validation to build models and to predict the antimicrobial activity of the peptides. Among the generated models, Random Forest has resulted in the best performance metrics for both Gram-negative dataset (Accuracy: 0.98, Recall: 0.99, Specificity: 0.97, Precision: 0.97, AUC: 0.99, F1: 0.98) and Gram-positive dataset (Accuracy: 0.95, Recall: 0.95, Specificity: 0.95, Precision: 0.90, AUC: 0.97, F1: 0.92) after outlier elimination is applied. This prediction approach might be useful to evaluate the antibacterial potential of a candidate peptide sequence before moving to the experimental studies.
Collapse
|
18
|
Bhosale H, Ramakrishnan V, Jayaraman VK. Support vector machine-based prediction of pore-forming toxins (PFT) using distributed representation of reduced alphabets. J Bioinform Comput Biol 2021; 19:2150028. [PMID: 34693886 DOI: 10.1142/s0219720021500281] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Bacterial virulence can be attributed to a wide variety of factors including toxins that harm the host. Pore-forming toxins are one class of toxins that confer virulence to the bacteria and are one of the promising targets for therapeutic intervention. In this work, we develop a sequence-based machine learning framework for the prediction of pore-forming toxins. For this, we have used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVMs). The choice of conformational similarity and hydropathy indices is based on the functional mechanism of pore-forming toxins. Our methodology achieves about 81% accuracy indicating that conformational similarity, an indicator of the flexibility of amino acids, along with hydrophobic index can capture the intrinsic features of pore-forming toxins that distinguish it from other types of transporter proteins. Increased understanding of the mechanisms of pore-forming toxins can further contribute to the use of such "mechanism-informed" features that may increase the prediction accuracy further.
Collapse
Affiliation(s)
- Hrushikesh Bhosale
- Department of Computer Science, FLAME University, Pune, Maharashtra, India
| | - Vigneshwar Ramakrishnan
- School of Chemical & Biotechnology, SASTRA Deemed-to-be University, Thanjavur, Tamilnadu, India
| | - Valadi K Jayaraman
- Department of Computer Science, FLAME University, Pune, Maharashtra, India
| |
Collapse
|
19
|
Yousef M, Goy G, Mitra R, Eischen CM, Jabeer A, Bakir-Gungor B. miRcorrNet: machine learning-based integration of miRNA and mRNA expression profiles, combined with feature grouping and ranking. PeerJ 2021; 9:e11458. [PMID: 34055490 PMCID: PMC8140596 DOI: 10.7717/peerj.11458] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 04/25/2021] [Indexed: 11/20/2022] Open
Abstract
A better understanding of disease development and progression mechanisms at the molecular level is critical both for the diagnosis of a disease and for the development of therapeutic approaches. The advancements in high throughput technologies allowed to generate mRNA and microRNA (miRNA) expression profiles; and the integrative analysis of these profiles allowed to uncover the functional effects of RNA expression in complex diseases, such as cancer. Several researches attempt to integrate miRNA and mRNA expression profiles using statistical methods such as Pearson correlation, and then combine it with enrichment analysis. In this study, we developed a novel tool called miRcorrNet, which performs machine learning-based integration to analyze miRNA and mRNA gene expression profiles. miRcorrNet groups mRNAs based on their correlation to miRNA expression levels and hence it generates groups of target genes associated with each miRNA. Then, these groups are subject to a rank function for classification. We have evaluated our tool using miRNA and mRNA expression profiling data downloaded from The Cancer Genome Atlas (TCGA), and performed comparative evaluation with existing tools. In our experiments we show that miRcorrNet performs as good as other tools in terms of accuracy (reaching more than 95% AUC value). Additionally, miRcorrNet includes ranking steps to separate two classes, namely case and control, which is not available in other tools. We have also evaluated the performance of miRcorrNet using a completely independent dataset. Moreover, we conducted a comprehensive literature search to explore the biological functions of the identified miRNAs. We have validated our significantly identified miRNA groups against known databases, which yielded about 90% accuracy. Our results suggest that miRcorrNet is able to accurately prioritize pan-cancer regulating high-confidence miRNAs. miRcorrNet tool and all other supplementary files are available at https://github.com/malikyousef/miRcorrNet.
Collapse
Affiliation(s)
- Malik Yousef
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel.,Department of Information Systems, Zefat Academic College, Zefat, Israel
| | - Gokhan Goy
- Department of Computer Engineering, Abdullah Gül University, Kayseri, Turkey
| | - Ramkrishna Mitra
- Department of Cancer Biology, Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, Pennsylvania, USA
| | - Christine M Eischen
- Department of Cancer Biology, Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, Pennsylvania, USA
| | - Amhar Jabeer
- Department of Computer Engineering, Abdullah Gül University, Kayseri, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gül University, Kayseri, Turkey
| |
Collapse
|