1
|
Durge AR, Shrimankar DD. DHFS-ECM: Design of a Dual Heuristic Feature Selection-based Ensemble Classification Model for the Identification of Bamboo Species from Genomic Sequences. Curr Genomics 2024; 25:185-201. [PMID: 39087000 PMCID: PMC11288165 DOI: 10.2174/0113892029268176240125055419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 01/16/2024] [Accepted: 01/16/2024] [Indexed: 08/02/2024] Open
Abstract
Background Analyzing genomic sequences plays a crucial role in understanding biological diversity and classifying Bamboo species. Existing methods for genomic sequence analysis suffer from limitations such as complexity, low accuracy, and the need for constant reconfiguration in response to evolving genomic datasets. Aim This study addresses these limitations by introducing a novel Dual Heuristic Feature Selection-based Ensemble Classification Model (DHFS-ECM) for the precise identification of Bamboo species from genomic sequences. Methods The proposed DHFS-ECM method employs a Genetic Algorithm to perform dual heuristic feature selection. This process maximizes inter-class variance, leading to the selection of informative N-gram feature sets. Subsequently, intra-class variance levels are used to create optimal training and validation sets, ensuring comprehensive coverage of class-specific features. The selected features are then processed through an ensemble classification layer, combining multiple stratification models for species-specific categorization. Results Comparative analysis with state-of-the-art methods demonstrate that DHFS-ECM achieves remarkable improvements in accuracy (9.5%), precision (5.9%), recall (8.5%), and AUC performance (4.5%). Importantly, the model maintains its performance even with an increased number of species classes due to the continuous learning facilitated by the Dual Heuristic Genetic Algorithm Model. Conclusion DHFS-ECM offers several key advantages, including efficient feature extraction, reduced model complexity, enhanced interpretability, and increased robustness and accuracy through the ensemble classification layer. These attributes make DHFS-ECM a promising tool for real-time clinical applications and a valuable contribution to the field of genomic sequence analysis.
Collapse
Affiliation(s)
- Aditi R Durge
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| | - Deepti D Shrimankar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| |
Collapse
|
2
|
Durge AR, Shrimankar DD, Sawarkar AD. Heuristic Analysis of Genomic Sequence Processing Models for High Efficiency Prediction: A Statistical Perspective. Curr Genomics 2022; 23:299-317. [PMID: 36778194 PMCID: PMC9878859 DOI: 10.2174/1389202923666220927105311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 08/29/2022] [Accepted: 09/01/2022] [Indexed: 11/22/2022] Open
Abstract
Genome sequences indicate a wide variety of characteristics, which include species and sub-species type, genotype, diseases, growth indicators, yield quality, etc. To analyze and study the characteristics of the genome sequences across different species, various deep learning models have been proposed by researchers, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Multilayer Perceptrons (MLPs), etc., which vary in terms of evaluation performance, area of application and species that are processed. Due to a wide differentiation between the algorithmic implementations, it becomes difficult for research programmers to select the best possible genome processing model for their application. In order to facilitate this selection, the paper reviews a wide variety of such models and compares their performance in terms of accuracy, area of application, computational complexity, processing delay, precision and recall. Thus, in the present review, various deep learning and machine learning models have been presented that possess different accuracies for different applications. For multiple genomic data, Repeated Incremental Pruning to Produce Error Reduction with Support Vector Machine (Ripper SVM) outputs 99.7% of accuracy, and for cancer genomic data, it exhibits 99.27% of accuracy using the CNN Bayesian method. Whereas for Covid genome analysis, Bidirectional Long Short-Term Memory with CNN (BiLSTM CNN) exhibits the highest accuracy of 99.95%. A similar analysis of precision and recall of different models has been reviewed. Finally, this paper concludes with some interesting observations related to the genomic processing models and recommends applications for their efficient use.
Collapse
Affiliation(s)
- Aditi R. Durge
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| | - Deepti D. Shrimankar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India,Address correspondence to this author at the Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India; Tel: 9860606477; E-mail:
| | - Ankush D. Sawarkar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| |
Collapse
|
3
|
Gutiérrez-Cárdenas J, Wang Z. Prediction of binding miRNAs involved with immune genes to the SARS-CoV-2 by using sequence features extraction and One-class SVM. INFORMATICS IN MEDICINE UNLOCKED 2022; 30:100958. [PMID: 35528315 PMCID: PMC9057929 DOI: 10.1016/j.imu.2022.100958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 04/25/2022] [Accepted: 04/25/2022] [Indexed: 10/24/2022] Open
Abstract
The prediction of host human miRNA binding to the SARS-COV-2-CoV-2 RNA sequence is of particular interest. This biological process could lead to virus repression, serve as biomarkers for diagnosis, or as potential treatments for this disease. One source of concern is attempting to uncover the viral regions in which this binding could occur, as well as how these miRNAs binding could affect the SARS-COV-2 virus's processes. Using extracted sequence features from this base pairing, we predicted the relationships between miRNAs that interact with genes involved in immune function and bind to the SARS-COV-2 genome in their 5' UTR region. We compared two supervised models, SVM and Random Forest, with an unsupervised One-Class SVM. When the results of the confusion matrices were inspected, the results of the supervised models were misleading, resulting in a Type II error. However, with the latter model, we achieved an average accuracy of 92%, sensitivity of 96.18%, and specificity of 78%. We hypothesize that studying the bind of miRNAs that affect immunological genes and bind to the SARS-COV-2 virus will lead to potential genetic therapies for fighting the disease or understanding how the immune system is affected when this type of viral infection occurs.
Collapse
Affiliation(s)
- Juan Gutiérrez-Cárdenas
- Universidad de Lima, Lima, Peru
- College of Science, Engineering and Technology, University of South Africa, Florida, 1710, South Africa
| | - Zenghui Wang
- College of Science, Engineering and Technology, University of South Africa, Florida, 1710, South Africa
| |
Collapse
|
4
|
Quillet A, Anouar Y, Lecroq T, Dubessy C. Prediction methods for microRNA targets in bilaterian animals: Toward a better understanding by biologists. Comput Struct Biotechnol J 2021; 19:5811-5825. [PMID: 34765096 PMCID: PMC8567327 DOI: 10.1016/j.csbj.2021.10.025] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2021] [Revised: 09/20/2021] [Accepted: 10/15/2021] [Indexed: 12/13/2022] Open
Abstract
MicroRNAs (miRNAs) are small noncoding RNAs that regulate gene expression at the posttranscriptional level. Because of their wide network of interactions, miRNAs have become the focus of many studies over the past decade, particularly in animal species. To streamline the number of potential wet lab experiments, the use of miRNA target prediction tools is currently the first step undertaken. However, the predictions made may vary considerably depending on the tool used, which is mostly due to the complex and still not fully understood mechanism of action of miRNAs. The discrepancies complicate the choice of the tool for miRNA target prediction. To provide a comprehensive view of this issue, we highlight in this review the main characteristics of miRNA-target interactions in bilaterian animals, describe the prediction models currently used, and provide some insights for the evaluation of predictor performance.
Collapse
Affiliation(s)
- Aurélien Quillet
- Normandie Université, UNIROUEN, INSERM, Laboratoire Différenciation et Communication Neuronale et Neuroendocrine, 76000 Rouen, France
| | - Youssef Anouar
- Normandie Université, UNIROUEN, INSERM, Laboratoire Différenciation et Communication Neuronale et Neuroendocrine, 76000 Rouen, France
| | - Thierry Lecroq
- Normandie Université, UNIROUEN, UNIHAVRE, INSA Rouen, Laboratoire d'Informatique du Traitement de l'Information et des Systèmes, 76000 Rouen, France
| | - Christophe Dubessy
- Normandie Université, UNIROUEN, INSERM, Laboratoire Différenciation et Communication Neuronale et Neuroendocrine, 76000 Rouen, France.,Normandie Université, UNIROUEN, INSERM, PRIMACEN, 76000 Rouen, France
| |
Collapse
|
5
|
Abstract
Teaching the fundamentals of computer programming in a first course (CS1) is a complex activity for the professor and is also a challenge for them. Nowadays, there are several teaching strategies for dealing with a CS1 at the university, one of which is the use of analogies to support the abstraction process that a student needs to carry for the appropriation of fundamental concepts. This article presents the results of applying a discovery model that allowed for the extraction of patterns, linguistic analysis, textual analytics, and linked data when using analogies for teaching the fundamental concepts of programming by professors in a CS1 in university programs that train software developers. For that reason, a discovery model based on machine learning and text mining was proposed using natural language processing techniques for semantic vector space modeling, distributional semantics, and the generation of synthetic data. The discovery process was carried out using nine supervised learning methods, three unsupervised learning methods, and one semi-supervised learning method involving linguistic analysis techniques, text analytics, and linked data. The main findings showed that professors include keywords, which are part of the technical computer terminology, in the form of verbs in the statement of the analogy and combine them in quantitative contexts with neutral or positive phrases, where numerical examples, cooking recipes, and games were the most used categories. Finally, a structure is proposed for the construction of analogies to teach programming concepts and this was validated by the professors and students.
Collapse
|
6
|
Quillet A, Saad C, Ferry G, Anouar Y, Vergne N, Lecroq T, Dubessy C. Improving Bioinformatics Prediction of microRNA Targets by Ranks Aggregation. Front Genet 2020; 10:1330. [PMID: 32047509 PMCID: PMC6997536 DOI: 10.3389/fgene.2019.01330] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 12/05/2019] [Indexed: 12/18/2022] Open
Abstract
microRNAs are noncoding RNAs which downregulate a large number of target mRNAs and modulate cell activity. Despite continued progress, bioinformatics prediction of microRNA targets remains a challenge since available software still suffer from a lack of accuracy and sensitivity. Moreover, these tools show fairly inconsistent results from one another. Thus, in an attempt to circumvent these difficulties, we aggregated all human results of four important prediction algorithms (miRanda, PITA, SVmicrO, and TargetScan) showing additional characteristics in order to rerank them into a single list. Instead of deciding which prediction tool to use, our method clearly helps biologists getting the best microRNA target predictions from all aggregated databases. The resulting database is freely available through a webtool called miRabel1 which can take either a list of miRNAs, genes, or signaling pathways as search inputs. Receiver operating characteristic curves and precision-recall curves analysis carried out using experimentally validated data and very large data sets show that miRabel significantly improves the prediction of miRNA targets compared to the four algorithms used separately. Moreover, using the same analytical methods, miRabel shows significantly better predictions than other popular algorithms such as MBSTAR, miRWalk, ExprTarget and miRMap. Interestingly, an F-score analysis revealed that miRabel also significantly improves the relevance of the top results. The aggregation of results from different databases is therefore a powerful and generalizable approach to many other species to improve miRNA target predictions. Thus, miRabel is an efficient tool to guide biologists in their search for miRNA targets and integrate them into a biological context.
Collapse
Affiliation(s)
- Aurélien Quillet
- Normandie Univ, UNIROUEN, INSERM, Laboratoire Différenciation et Communication Neuronale et Neuroendocrine, Rouen, France
| | - Chadi Saad
- Normandie Univ, UNIROUEN, INSERM, Laboratoire Différenciation et Communication Neuronale et Neuroendocrine, Rouen, France
| | - Gaëtan Ferry
- Normandie Univ, UNIROUEN, UNIHAVRE, INSA Rouen, Laboratoire d'Informatique du Traitement de l'Information et des Systèmes, Rouen, France
| | - Youssef Anouar
- Normandie Univ, UNIROUEN, INSERM, Laboratoire Différenciation et Communication Neuronale et Neuroendocrine, Rouen, France
| | - Nicolas Vergne
- Normandie Univ, UNIROUEN, CNRS, Laboratoire de Mathématiques Raphaël Salem, Rouen, France
| | - Thierry Lecroq
- Normandie Univ, UNIROUEN, UNIHAVRE, INSA Rouen, Laboratoire d'Informatique du Traitement de l'Information et des Systèmes, Rouen, France
| | - Christophe Dubessy
- Normandie Univ, UNIROUEN, INSERM, Laboratoire Différenciation et Communication Neuronale et Neuroendocrine, Rouen, France
| |
Collapse
|