1
|
Karimi-Fard A, Saidi A, TohidFar M, Emami SN. Novel candidate genes for environmental stresses response in Synechocystis sp. PCC 6803 revealed by machine learning algorithms. Braz J Microbiol 2024; 55:1219-1229. [PMID: 38705959 PMCID: PMC11153407 DOI: 10.1007/s42770-024-01338-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 04/03/2024] [Indexed: 05/07/2024] Open
Abstract
Cyanobacteria have developed acclimation strategies to adapt to harsh environments, making them a model organism. Understanding the molecular mechanisms of tolerance to abiotic stresses can help elucidate how cells change their gene expression patterns in response to stress. Recent advances in sequencing techniques and bioinformatics analysis methods have led to the discovery of many genes involved in stress response in organisms. The Synechocystis sp. PCC 6803 is a suitable microorganism for studying transcriptome response under environmental stress. Therefore, for the first time, we employed two effective feature selection techniques namely and support vector machine recursive feature elimination (SVM-RFE) and LASSO (Least Absolute Shrinkage Selector Operator) to pinpoint the crucial genes responsive to environmental stresses in Synechocystis sp. PCC 6803. We applied these algorithms of machine learning to analyze the transcriptomic data of Synechocystis sp. PCC 6803 under distinct conditions, encompassing light, salt and iron stress conditions. Seven candidate genes namely sll1862, slr0650, sll0760, slr0091, ssl3044, slr1285, and slr1687 were selected by both LASSO and SVM-RFE algorithms. RNA-seq analysis was performed to validate the efficiency of our feature selection approach in selecting the most important genes. The RNA-seq analysis revealed significantly high expression for five genes namely sll1862, slr1687, ssl3044, slr1285, and slr0650 under ion stress condition. Among these five genes, ssl3044 and slr0650 could be introduced as new potential candidate genes for further confirmatory genetic studies, to determine their roles in their response to abiotic stresses.
Collapse
Affiliation(s)
- Abbas Karimi-Fard
- Department of Cell and Molecular Biology, Faculty of Life Sciences and Biotechnology, Shahid Beheshti University, Tehran, Iran
| | - Abbas Saidi
- Department of Cell and Molecular Biology, Faculty of Life Sciences and Biotechnology, Shahid Beheshti University, Tehran, Iran.
| | - Masoud TohidFar
- Department of Cell and Molecular Biology, Faculty of Life Sciences and Biotechnology, Shahid Beheshti University, Tehran, Iran.
| | - Seyedeh Noushin Emami
- Department of Molecular Biosciences, Wenner-Gren Institute, Stockholm University, Stockholm, Sweden
| |
Collapse
|
2
|
Nekouie N, Romoozi M, Esmaeili M. A New Evolutionary Ensemble Learning of Multimodal Feature Selection from Microarray Data. Neural Process Lett 2023. [DOI: 10.1007/s11063-023-11159-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
|
3
|
An in-depth and contrasting survey of meta-heuristic approaches with classical feature selection techniques specific to cervical cancer. Knowl Inf Syst 2023. [DOI: 10.1007/s10115-022-01825-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
|
4
|
Alnazer I, Falou O, Bourdon P, Urruty T, Guillevin R, Khalil M, Shahin A, Fernandez-Maloigne C. Usefulness of computed tomography textural analysis in renal cell carcinoma nuclear grading. J Med Imaging (Bellingham) 2022; 9:054501. [PMID: 36120414 PMCID: PMC9467905 DOI: 10.1117/1.jmi.9.5.054501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 08/24/2022] [Indexed: 09/15/2023] Open
Abstract
Purpose: To evaluate the usefulness of computed tomography (CT) texture descriptors integrated with machine-learning (ML) models in the identification of clear cell renal cell carcinoma (ccRCC) and for the first time papillary renal cell carcinoma (pRCC) tumor nuclear grades [World Health Organization (WHO)/International Society of Urologic Pathologists (ISUP) 1, 2, 3, and 4]. Approach: A total of 143 ccRCC and 21 pRCC patients were analyzed in this study. Texture features were extracted from late arterial phase CT images. A complete separation of training/validation and testing subsets from the beginning to the end of the pipeline was adopted. Feature dimension was reduced by collinearity analysis and Gini impurity-based feature selection. The synthetic minority over-sampling technique was employed for imbalanced datasets. The ML classifiers were logistic regression, SVM, RF, multi-layer perceptron, and K -NN. The differentiation between low grades/ high grades, grade 1/grade 2, grade 3/grade 4, and between all grades was assessed for ccRCC and pRCC datasets. The classification performance was assessed and compared by certain metrics. Results: Textures-based classifiers were able to efficiently identify ccRCC and pRCC grades. An accuracy and area under the characteristic operating curve (AUC) up to 91%/0.9, 91%/0.9, 90%/0.9, and 88%/1 were reached when discriminating ccRCC low grades/ high grades, grade 1/grade 2, grade 3/grade 4, and all grades, respectively. An accuracy and AUC up to 96%/1, 81%/0.8, 86%/0.9, and 88%/0.9 were found when differentiating pRCC low grades/ high grades, grade 1/grade 2, grade 3/grade 4, and all grades, respectively. Conclusion: CT texture-based ML models can be used to assist radiologist in predicting the WHO/ISUP grade of ccRCC and pRCC pre-operatively.
Collapse
Affiliation(s)
- Israa Alnazer
- Université de Poitiers, XLIM-ICONES, UMR CNRS 7252, Poitiers, France
- Laboratoire commun CNRS/SIEMENS I3M, Poitiers, France
- Lebanese University, AZM Center for Research in Biotechnology and Its Applications, EDST, Tripoli, Lebanon
| | - Omar Falou
- Lebanese University, AZM Center for Research in Biotechnology and Its Applications, EDST, Tripoli, Lebanon
- American University of Culture and Education, Koura, Lebanon
- Lebanese University, Faculty of Science, Tripoli, Lebanon
- Centre Hospitalier Universitaire de Poitiers, Poitiers, France
| | - Pascal Bourdon
- Université de Poitiers, XLIM-ICONES, UMR CNRS 7252, Poitiers, France
- Laboratoire commun CNRS/SIEMENS I3M, Poitiers, France
| | - Thierry Urruty
- Université de Poitiers, XLIM-ICONES, UMR CNRS 7252, Poitiers, France
- Laboratoire commun CNRS/SIEMENS I3M, Poitiers, France
| | - Rémy Guillevin
- Laboratoire commun CNRS/SIEMENS I3M, Poitiers, France
- Centre Hospitalier Universitaire de Poitiers, Poitiers, France
| | - Mohamad Khalil
- Lebanese University, AZM Center for Research in Biotechnology and Its Applications, EDST, Tripoli, Lebanon
| | - Ahmad Shahin
- Lebanese University, AZM Center for Research in Biotechnology and Its Applications, EDST, Tripoli, Lebanon
| | - Christine Fernandez-Maloigne
- Université de Poitiers, XLIM-ICONES, UMR CNRS 7252, Poitiers, France
- Laboratoire commun CNRS/SIEMENS I3M, Poitiers, France
| |
Collapse
|
5
|
Zanella L, Facco P, Bezzo F, Cimetta E. Feature Selection and Molecular Classification of Cancer Phenotypes: A Comparative Study. Int J Mol Sci 2022; 23:ijms23169087. [PMID: 36012350 PMCID: PMC9408964 DOI: 10.3390/ijms23169087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 08/09/2022] [Accepted: 08/11/2022] [Indexed: 11/16/2022] Open
Abstract
The classification of high dimensional gene expression data is key to the development of effective diagnostic and prognostic tools. Feature selection involves finding the best subset with the highest power in predicting class labels. Here, we conducted a comparative study focused on different combinations of feature selectors (Chi-Squared, mRMR, Relief-F, and Genetic Algorithms) and classification learning algorithms (Random Forests, PLS-DA, SVM, Regularized Logistic/Multinomial Regression, and kNN) to identify those with the best predictive capacity. The performance of each combination is evaluated through an empirical study on three benchmark cancer-related microarray datasets. Our results first suggest that the quality of the data relevant to the target classes is key for the successful classification of cancer phenotypes. We also proved that, for a given classification learning algorithm and dataset, all filters have a similar performance. Interestingly, filters achieve comparable or even better results with respect to the GA-based wrappers, while also being easier and faster to implement. Taken together, our findings suggest that simple, well-established feature selectors in combination with optimized classifiers guarantee good performances, with no need for complicated and computationally demanding methodologies.
Collapse
Affiliation(s)
- Luca Zanella
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Pierantonio Facco
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Fabrizio Bezzo
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Elisa Cimetta
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
- Fondazione Istituto di Ricerca Pediatrica Città della Speranza (IRP), 35127 Padova, Italy
- Correspondence:
| |
Collapse
|
6
|
Vijayan A, Fatima S, Sowmya A, Vafaee F. Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods. Brief Bioinform 2022; 23:6658855. [PMID: 35945147 DOI: 10.1093/bib/bbac315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/11/2022] [Accepted: 07/12/2022] [Indexed: 11/13/2022] Open
Abstract
Liquid biopsy has shown promise for cancer diagnosis due to its minimally invasive nature and the potential for novel biomarker discovery. However, the low concentration of relevant blood-based biosources and the heterogeneity of samples (i.e. the variability of relative abundance of molecules identified), pose major challenges to biomarker discovery. Moreover, the number of molecular measurements or features (e.g. transcript read counts) per sample could be in the order of several thousand, whereas the number of samples is often substantially lower, leading to the curse of dimensionality. These challenges, among others, elucidate the importance of a robust biomarker panel identification or feature extraction step wherein relevant molecular measurements are identified prior to classification for cancer detection. In this work, we performed a benchmarking study on 12 feature extraction methods using transcriptomic profiles derived from different blood-based biosources. The methods were assessed both in terms of their predictive performance and the robustness of the biomarker panels in diagnosing cancer or stratifying cancer subtypes. While performing the comparison, the feature extraction methods are categorized into feature subset selection methods and transformation methods. A transformation feature extraction method, namely partial least square discriminant analysis, was found to perform consistently superior in terms of classification performance. As part of the benchmarking study, a generic pipeline has been created and made available as an R package to ensure reproducibility of the results and allow for easy extension of this study to other datasets (https://github.com/VafaeeLab/bloodbased-pancancer-diagnosis).
Collapse
Affiliation(s)
- Abhishek Vijayan
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia.,School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia
| | - Shadma Fatima
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia.,Ingham Institute, NSW, Australia
| | - Arcot Sowmya
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia.,UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia.,UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
| |
Collapse
|
7
|
Binary Approaches of Quantum-Based Avian Navigation Optimizer to Select Effective Features from High-Dimensional Medical Data. MATHEMATICS 2022. [DOI: 10.3390/math10152770] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Many metaheuristic approaches have been developed to select effective features from different medical datasets in a feasible time. However, most of them cannot scale well to large medical datasets, where they fail to maximize the classification accuracy and simultaneously minimize the number of selected features. Therefore, this paper is devoted to developing an efficient binary version of the quantum-based avian navigation optimizer algorithm (QANA) named BQANA, utilizing the scalability of the QANA to effectively select the optimal feature subset from high-dimensional medical datasets using two different approaches. In the first approach, several binary versions of the QANA are developed using S-shaped, V-shaped, U-shaped, Z-shaped, and quadratic transfer functions to map the continuous solutions of the canonical QANA to binary ones. In the second approach, the QANA is mapped to binary space by converting each variable to 0 or 1 using a threshold. To evaluate the proposed algorithm, first, all binary versions of the QANA are assessed on different medical datasets with varied feature sizes, including Pima, HeartEW, Lymphography, SPECT Heart, PenglungEW, Parkinson, Colon, SRBCT, Leukemia, and Prostate tumor. The results show that the BQANA developed by the second approach is superior to other binary versions of the QANA to find the optimal feature subset from the medical datasets. Then, the BQANA was compared with nine well-known binary metaheuristic algorithms, and the results were statistically assessed using the Friedman test. The experimental and statistical results demonstrate that the proposed BQANA has merit for feature selection from medical datasets.
Collapse
|
8
|
Mining transcriptomic data to identify Saccharomyces cerevisiae signatures related to improved and repressed ethanol production under fermentation. PLoS One 2022; 17:e0259476. [PMID: 35881609 PMCID: PMC9321456 DOI: 10.1371/journal.pone.0259476] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 07/12/2022] [Indexed: 11/19/2022] Open
Abstract
Saccharomyces cerevisiae is known for its outstanding ability to produce ethanol in industry. Underlying the dynamics of gene expression in S. cerevisiae in response to fermentation could provide informative results, required for the establishment of any ethanol production improvement program. Thus, representing a new approach, this study was conducted to identify the discriminative genes between improved and repressed ethanol production as well as clarifying the molecular responses to this process through mining the transcriptomic data. The significant differential expression probe sets were extracted from available microarray datasets related to yeast fermentation performance. To identify the most effective probe sets contributing to discriminate ethanol content, 11 machine learning algorithms from RapidMiner were employed. Further analysis including pathway enrichment and regulatory analysis were performed on discriminative probe sets. Besides, the decision tree models were constructed, the performance of each model was evaluated and the roots were identified. Based on the results, 171 probe sets were identified by at least 5 attribute weighting algorithms (AWAs) and 17 roots were recognized with 100% performance Some of the top ranked presets were found to be involved in carbohydrate metabolism, oxidative phosphorylation, and ethanol fermentation. Principal component analysis (PCA) and heatmap clustering validated the top-ranked selective probe sets. In addition, the top-ranked genes were validated based on GSE78759 and GSE5185 dataset. From all discriminative probe sets, OLI1 and CYC3 were identified as the roots with the best performance, demonstrated by the most weighting algorithms and linked to top two significant enriched pathways including porphyrin biosynthesis and oxidative phosphorylation. ADH5 and PDA1 were also recognized as differential top-ranked genes that contribute to ethanol production. According to the regulatory clustering analysis, Tup1 has a significant effect on the top-ranked target genes CYC3 and ADH5 genes. This study provides a basic understanding of the S. cerevisiae cell molecular mechanism and responses to two different medium conditions (Mg2+ and Cu2+) during the fermentation process.
Collapse
|
9
|
Naor-Hoffmann S, Svetlitsky D, Sal-Man N, Orenstein Y, Ziv-Ukelson M. Predicting the pathogenicity of bacterial genomes using widely spread protein families. BMC Bioinformatics 2022; 23:253. [PMID: 35751023 PMCID: PMC9233384 DOI: 10.1186/s12859-022-04777-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 04/13/2022] [Indexed: 11/15/2022] Open
Abstract
Background The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved. Results We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.
Collapse
Affiliation(s)
- Shaked Naor-Hoffmann
- Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | - Dina Svetlitsky
- Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | - Neta Sal-Man
- The Shraga Segal Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | - Michal Ziv-Ukelson
- Department of Computer Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel.
| |
Collapse
|
10
|
Polewko-Klim A, Zhu S, Wu W, Xie Y, Cai N, Zhang K, Zhu Z, Qing T, Yuan Z, Xu K, Zhang T, Lu M, Ye W, Chen X, Suo C, Rudnicki WR. Identification of Candidate Therapeutic Genes for More Precise Treatment of Esophageal Squamous Cell Carcinoma and Adenocarcinoma. Front Genet 2022; 13:844542. [PMID: 35664298 PMCID: PMC9161154 DOI: 10.3389/fgene.2022.844542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Accepted: 04/20/2022] [Indexed: 11/23/2022] Open
Abstract
The standard therapy administered to patients with advanced esophageal cancer remains uniform, despite its two main histological subtypes, namely esophageal squamous cell carcinoma (SCC) and esophageal adenocarcinoma (AC), are being increasingly considered to be different. The identification of potential drug target genes between SCC and AC is crucial for more effective treatment of these diseases, given the high toxicity of chemotherapy and resistance to administered medications. Herein we attempted to identify and rank differentially expressed genes (DEGs) in SCC vs. AC using ensemble feature selection methods. RNA-seq data from The Cancer Genome Atlas and the Fudan-Taizhou Institute of Health Sciences (China). Six feature filters algorithms were used to identify DEGs. We built robust predictive models for histological subtypes with the random forest (RF) classification algorithm. Pathway analysis also be performed to investigate the functional role of genes. 294 informative DEGs (87 of them are newly discovered) have been identified. The areas under receiver operator curve (AUC) were higher than 99.5% for all feature selection (FS) methods. Nine genes (i.e., ERBB3, ATP7B, ABCC3, GALNT14, CLDN18, GUCY2C, FGFR4, KCNQ5, and CACNA1B) may play a key role in the development of more directed anticancer therapy for SCC and AC patients. The first four of them are drug targets for chemotherapy and immunotherapy of esophageal cancer and involved in pharmacokinetics and pharmacodynamics pathways. Research identified novel DEGs in SCC and AC, and detected four potential drug targeted genes (ERBB3, ATP7B, ABCC3, and GALNT14) and five drug-related genes.
Collapse
Affiliation(s)
- Aneta Polewko-Klim
- Institute of Computer Science, University in Bialystok, Białystok, Poland
| | - Sibo Zhu
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China
| | - Weicheng Wu
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Yijing Xie
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Ning Cai
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Kexun Zhang
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Zhen Zhu
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Tao Qing
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Ziyu Yuan
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Kelin Xu
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Tiejun Zhang
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
| | - Ming Lu
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
- Clinical Epidemiology Unit, Qilu Hospital of Shandong University, Jinan, China
| | - Weimin Ye
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
| | - Xingdong Chen
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China
| | - Chen Suo
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, China
- Fudan-Taizhou Institute of Health Sciences, Taizhou, China
- Shanghai Institute of Infectious Disease and Biosecurity, Shanghai, China
| | - Witold R. Rudnicki
- Institute of Computer Science, University in Bialystok, Białystok, Poland
- Computational Centre, University of Bialystok, Białystok, Poland
| |
Collapse
|
11
|
AlRashid SZ, Dosh MH, Obaid AJ. Classification of the Senescence-Accelerated Mouse (SAM) Strains With Its Behaviour Using Deep Learning. INTERNATIONAL JOURNAL OF E-COLLABORATION 2022. [DOI: 10.4018/ijec.304035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Microarray technology is a novel method to monitor the levels of expression of a huge number of genes simultaneously.this study aims at (1) identifying the most important genes in the molecular senescence of the hippocampus and retina, where both with accelerated neurological senescence (S10 and 8) models were obtainable. By using feature selection to reduce the size of high dimensional data. Hence, the process of gene selection is twofold; removing the irrelevant genes and selecting the informative genes, and (2) The determination of the study is to specify the association among these genes or pathways that would deliver insight into the mechanism for this phenotype which will be greater to the current imperfect state-of-the-art estimates. In this study, gene selection methods have been implemented, including Analysis of Variance (ANOVA). The results are showed that CNN model achieve 0.98 accuracy based on a subset of genes from ANOVA method. Thus, Genes subset selected is achieved a better accuracy at classification and a little time of processing.
Collapse
|
12
|
Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis. BIG DATA AND COGNITIVE COMPUTING 2022. [DOI: 10.3390/bdcc6010024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In high-dimensional data analysis, Feature Selection (FS) is one of the most fundamental issues in machine learning and requires the attention of researchers. These datasets are characterized by huge space due to a high number of features, out of which only a few are significant for analysis. Thus, significant feature extraction is crucial. There are various techniques available for feature selection; among them, the filter techniques are significant in this community, as they can be used with any type of learning algorithm and drastically lower the running time of optimization algorithms and improve the performance of the model. Furthermore, the application of a filter approach depends on the characteristics of the dataset as well as on the machine learning model. Thus, to avoid these issues in this research, a combination of feature reduction (CFR) is considered designing a pipeline of filter approaches for high-dimensional microarray data classification. Considering four filter approaches, sixteen combinations of pipelines are generated. The feature subset is reduced in different levels, and ultimately, the significant feature set is evaluated. The pipelined filter techniques are Correlation-Based Feature Selection (CBFS), Chi-Square Test (CST), Information Gain (InG), and Relief Feature Selection (RFS), and the classification techniques are Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and k-Nearest Neighbor (k-NN). The performance of CFR depends highly on the datasets as well as on the classifiers. Thereafter, the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method is used for ranking all reduction combinations and evaluating the superior filter combination among all.
Collapse
|
13
|
Wu Y, Guo Y, Ma J, Sa Y, Li Q, Zhang N. Research Progress of Gliomas in Machine Learning. Cells 2021; 10:cells10113169. [PMID: 34831392 PMCID: PMC8622230 DOI: 10.3390/cells10113169] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Revised: 11/04/2021] [Accepted: 11/05/2021] [Indexed: 12/29/2022] Open
Abstract
In the field of gliomas research, the broad availability of genetic and image information originated by computer technologies and the booming of biomedical publications has led to the advent of the big-data era. Machine learning methods were applied as possible approaches to speed up the data mining processes. In this article, we reviewed the present situation and future orientations of machine learning application in gliomas within the context of workflows to integrate analysis for precision cancer care. Publicly available tools or algorithms for key machine learning technologies in the literature mining for glioma clinical research were reviewed and compared. Further, the existing solutions of machine learning methods and their limitations in glioma prediction and diagnostics, such as overfitting and class imbalanced, were critically analyzed.
Collapse
|
14
|
Larco A, Montenegro C, Yanez C, Luján-Mora S. An experience selecting quality features of apps for people with disabilities using abductive approach to explanatory theory generation. PeerJ Comput Sci 2021; 7:e595. [PMID: 34435092 PMCID: PMC8356648 DOI: 10.7717/peerj-cs.595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Accepted: 05/24/2021] [Indexed: 06/13/2023]
Abstract
This study determines one of the most relevant quality factors of apps for people with disabilities utilizing the abductive approach to the generation of an explanatory theory. First, the abductive approach was concerned with the results' description, established by the apps' quality assessment, using the Mobile App Rating Scale (MARS) tool. However, because of the restrictions of MARS outputs, the identification of critical quality factors could not be established, requiring the search for an answer for a new rule. Finally, the explanation of the case (the last component of the abductive approach) to test the rule's new hypothesis. This problem was solved by applying a new quantitative model, compounding data mining techniques, which identified MARS' most relevant quality items. Hence, this research defines a much-needed theoretical and practical tool for academics and also practitioners. Academics can experiment utilizing the abduction reasoning procedure as an alternative to achieve positivism in research. This study is a first attempt to improve the MARS tool, aiming to provide specialists relevant data, reducing noise effects, accomplishing better predictive results to enhance their investigations. Furthermore, it offers a concise quality assessment of disability-related apps.
Collapse
Affiliation(s)
- Andres Larco
- Departamento de Informática y Ciencias de la Computación, Escuela Politécnica Nacional, Quito, Pichincha, Ecuador
| | - Carlos Montenegro
- Departamento de Informática y Ciencias de la Computación, Escuela Politécnica Nacional, Quito, Pichincha, Ecuador
| | - Cesar Yanez
- Departamento de Informática y Ciencias de la Computación, Escuela Politécnica Nacional, Quito, Pichincha, Ecuador
| | - Sergio Luján-Mora
- Department of Software and Computing Systems, Universidad de Alicante, Alicante, Spain
| |
Collapse
|
15
|
Mohammed M, Mwambi H, Mboya IB, Elbashir MK, Omolo B. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci Rep 2021; 11:15626. [PMID: 34341396 PMCID: PMC8329290 DOI: 10.1038/s41598-021-95128-x] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 07/19/2021] [Indexed: 12/13/2022] Open
Abstract
Cancer tumor classification based on morphological characteristics alone has been shown to have serious limitations. Breast, lung, colorectal, thyroid, and ovarian are the most commonly diagnosed cancers among women. Precise classification of cancers into their types is considered a vital problem for cancer diagnosis and therapy. In this paper, we proposed a stacking ensemble deep learning model based on one-dimensional convolutional neural network (1D-CNN) to perform a multi-class classification on the five common cancers among women based on RNASeq data. The RNASeq gene expression data was downloaded from Pan-Cancer Atlas using GDCquery function of the TCGAbiolinks package in the R software. We used least absolute shrinkage and selection operator (LASSO) as feature selection method. We compared the results of the new proposed model with and without LASSO with the results of the single 1D-CNN and machine learning methods which include support vector machines with radial basis function, linear, and polynomial kernels; artificial neural networks; k-nearest neighbors; bagging trees. The results show that the proposed model with and without LASSO has a better performance compared to other classifiers. Also, the results show that the machine learning methods (SVM-R, SVM-L, SVM-P, ANN, KNN, and bagging trees) with under-sampling have better performance than with over-sampling techniques. This is supported by the statistical significance test of accuracy where the p-values for differences between the SVM-R and SVM-P, SVM-R and ANN, SVM-R and KNN are found to be p = 0.003, p = < 0.001, and p = < 0.001, respectively. Also, SVM-L had a significant difference compared to ANN p = 0.009. Moreover, SVM-P and ANN, SVM-P and KNN are found to be significantly different with p-values p = < 0.001 and p = < 0.001, respectively. In addition, ANN and bagging trees, ANN and KNN were found to be significantly different with p-values p = < 0.001 and p = 0.004, respectively. Thus, the proposed model can help in the early detection and diagnosis of cancer in women, and hence aid in designing early treatment strategies to improve survival.
Collapse
Affiliation(s)
- Mohanad Mohammed
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa.
| | - Henry Mwambi
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
| | - Innocent B Mboya
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
- Department of Epidemiology and Biostatistics, Kilimanjaro Christian Medical University College (KCMUCo), P. O. Box 2240, Moshi, Tanzania
| | - Murtada K Elbashir
- College of Computer and Information Sciences, Jouf University, Sakaka, 72441, Saudi Arabia
- Faculty of Mathematical and Computer Sciences, University of Gezira, Wad Madani, 11123, Sudan
| | - Bernard Omolo
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
- Division of Mathematics and Computer Science, University of South Carolina-Upstate, 800 University Way, Spartanburg, USA
- School of Public Health, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa
| |
Collapse
|
16
|
Källberg D, Vidman L, Rydén P. Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes. Front Genet 2021; 12:632620. [PMID: 33719342 PMCID: PMC7943624 DOI: 10.3389/fgene.2021.632620] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Accepted: 02/03/2021] [Indexed: 11/13/2022] Open
Abstract
Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (-0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.
Collapse
Affiliation(s)
- David Källberg
- Department of Statistics, USBE, Umeå University, Umeå, Sweden
- Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, Sweden
| | - Linda Vidman
- Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, Sweden
- Department of Radiation Sciences, Oncology, Umeå University, Umeå, Sweden
| | - Patrik Rydén
- Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, Sweden
| |
Collapse
|
17
|
Lu L, Townsend KA, Daigle BJ. GEOlimma: differential expression analysis and feature selection using pre-existing microarray data. BMC Bioinformatics 2021; 22:44. [PMID: 33535967 PMCID: PMC7860207 DOI: 10.1186/s12859-020-03932-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 12/11/2020] [Indexed: 12/14/2022] Open
Abstract
Background Differential expression and feature selection analyses are essential steps for the development of accurate diagnostic/prognostic classifiers of complicated human diseases using transcriptomics data. These steps are particularly challenging due to the curse of dimensionality and the presence of technical and biological noise. A promising strategy for overcoming these challenges is the incorporation of pre-existing transcriptomics data in the identification of differentially expressed (DE) genes. This approach has the potential to improve the quality of selected genes, increase classification performance, and enhance biological interpretability. While a number of methods have been developed that use pre-existing data for differential expression analysis, existing methods do not leverage the identities of experimental conditions to create a robust metric for identifying DE genes. Results In this study, we propose a novel differential expression and feature selection method—GEOlimma—which combines pre-existing microarray data from the Gene Expression Omnibus (GEO) with the widely-applied Limma method for differential expression analysis. We first quantify differential gene expression across 2481 pairwise comparisons from 602 curated GEO Datasets, and we convert differential expression frequencies to DE prior probabilities. Genes with high DE prior probabilities show enrichment in cell growth and death, signal transduction, and cancer-related biological pathways, while genes with low prior probabilities were enriched in sensory system pathways. We then applied GEOlimma to four differential expression comparisons within two human disease datasets and performed differential expression, feature selection, and supervised classification analyses. Our results suggest that use of GEOlimma provides greater experimental power to detect DE genes compared to Limma, due to its increased effective sample size. Furthermore, in a supervised classification analysis using GEOlimma as a feature selection method, we observed similar or better classification performance than Limma given small, noisy subsets of an asthma dataset. Conclusions Our results demonstrate that GEOlimma is a more effective method for differential gene expression and feature selection analyses compared to the standard Limma method. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.
Collapse
Affiliation(s)
- Liangqun Lu
- Department of Biological Sciences, University of Memphis, Memphis, USA.,Department of Computer Science, University of Memphis, Memphis, USA
| | - Kevin A Townsend
- Department of Computer Science, University of Memphis, Memphis, USA
| | - Bernie J Daigle
- Department of Biological Sciences, University of Memphis, Memphis, USA. .,Department of Computer Science, University of Memphis, Memphis, USA.
| |
Collapse
|
18
|
Memory based cuckoo search algorithm for feature selection of gene expression dataset. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100572] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
19
|
Pham TA, Tran VQ, Vu HLT, Ly HB. Design deep neural network architecture using a genetic algorithm for estimation of pile bearing capacity. PLoS One 2020; 15:e0243030. [PMID: 33332377 PMCID: PMC7746167 DOI: 10.1371/journal.pone.0243030] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Accepted: 11/16/2020] [Indexed: 11/19/2022] Open
Abstract
Determination of pile bearing capacity is essential in pile foundation design. This study focused on the use of evolutionary algorithms to optimize Deep Learning Neural Network (DLNN) algorithm to predict the bearing capacity of driven pile. For this purpose, a Genetic Algorithm (GA) was developed to select the most significant features in the raw dataset. After that, a GA-DLNN hybrid model was developed to select optimal parameters for the DLNN model, including: network algorithm, activation function for hidden neurons, number of hidden layers, and the number of neurons in each hidden layer. A database containing 472 driven pile static load test reports was used. The dataset was divided into three parts, namely the training set (60%), validation (20%) and testing set (20%) for the construction, validation and testing phases of the proposed model, respectively. Various quality assessment criteria, namely the coefficient of determination (R2), Index of Agreement (IA), mean absolute error (MAE) and root mean squared error (RMSE), were used to evaluate the performance of the machine learning (ML) algorithms. The GA-DLNN hybrid model was shown to exhibit the ability to find the most optimal set of parameters for the prediction process.The results showed that the performance of the hybrid model using only the most critical features gave the highest accuracy, compared with those obtained by the hybrid model using all input variables.
Collapse
Affiliation(s)
- Tuan Anh Pham
- University of Transport Technology, Hanoi, Vietnam
- * E-mail:
| | | | | | - Hai-Bang Ly
- University of Transport Technology, Hanoi, Vietnam
| |
Collapse
|
20
|
Ghosh SK, Ghosh A. A Novel Human Diabetes Biomarker Recognition Approach Using Fuzzy Rough Multigranulation Nearest Neighbour Classifier Model. Interdiscip Sci 2020; 12:461-475. [PMID: 32920773 DOI: 10.1007/s12539-020-00391-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Revised: 08/22/2020] [Accepted: 08/31/2020] [Indexed: 10/23/2022]
Abstract
The selection of gene identifier from microarray databases is a challenging task since microarray contains large number of gene attributes for a few samples. This article proposes a novel fuzzy-rough set-based gene expression features selection using fuzzy-rough reduct under multi-granular space for human diabetes patient. Firstly, fuzzy multi-granular gain has been computed from the expression datasets via fuzzy entropy which reduces the dimension of the database. Thereafter, the features have been selected from microarray using the fuzzy rough reduct and information gain with respect to their expression patterns. To reduce the computational cost, a decision making scheme has been designed using a rough approximation of a fuzzy concept in the field of multi-granulation framework. Finally, we have recognized the association among the genomes that have expressively different expression patterns from controlled state to the diabetic state with respect to their impression using modified fuzzy-rough nearest neighbour classifier (FRNNC). Five standard diabetic microarray datasets have been considered to quantify the efficiency of the designed FRNNC model and are validated with F measure using diabetes gene expression NCBI database and it performs superior compared to existing methods.
Collapse
Affiliation(s)
- Swarup Kr Ghosh
- Department of Computer Science and Engineering, Sister Nivedita University, Kolkata, India.
| | - Anupam Ghosh
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| |
Collapse
|
21
|
Pan YB, Zhu Y, Zhang QW, Zhang CH, Shao A, Zhang J. Prognostic and Predictive Value of a Long Non-coding RNA Signature in Glioma: A lncRNA Expression Analysis. Front Oncol 2020; 10:1057. [PMID: 32793467 PMCID: PMC7394186 DOI: 10.3389/fonc.2020.01057] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Accepted: 05/27/2020] [Indexed: 01/16/2023] Open
Abstract
The current histologically based grading system for glioma does not accurately predict which patients will have better outcomes or benefit from adjuvant chemotherapy. We proposed that combining the expression profiles of multiple long non-coding RNAs (lncRNAs) into a single model could improve prediction accuracy. We included 1,094 glioma patients from three different datasets. Using the least absolute shrinkage and selection operator (LASSO) Cox regression model, we built a multiple-lncRNA-based classifier on the basis of a training set. The predictive and prognostic accuracy of the classifier was validated using an internal test set and two external independent sets. Using this classifier, we classified patients in the training set into high- or low-risk groups with significantly different overall survival (OS, HR = 8.42, 95% CI = 4.99–14.2, p < 0.0001). The prognostic power of the classifier was then assessed in the other sets. The classifier was an independent prognostic factor and had better prognostic value than clinicopathological risk factors. The patients in the high-risk group were found to have a favorable response to adjuvant chemotherapy (HR = 0.4, 95% CI = 0.25–0.64, p < 0.0001). We built a nomogram that integrated the 10-lncRNA-based classifier and four clinicopathological risk factors to predict 3 and 5 year OS. Gene set variation analysis (GSVA) showed that pathways related to tumorigenesis, undifferentiated cancer, and epithelial–mesenchymal transition were enriched in the high-risk groups. Our classifier built on 10-lncRNAs is a reliable prognostic and predictive tool for OS in glioma patients and could predict which patients would benefit from adjuvant chemotherapy.
Collapse
Affiliation(s)
- Yuan-Bo Pan
- Department of Neurosurgery, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Yiming Zhu
- Department of General Surgery, Shanghai Ninth People's Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Qing-Wei Zhang
- Division of Gastroenterology and Hepatology, Key Laboratory of Gastroenterology and Hepatology, Ministry of Health, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China.,Shanghai Institute of Digestive Disease, Shanghai Jiao Tong University, Shanghai, China
| | - Chi-Hao Zhang
- Department of General Surgery, Shanghai Ninth People's Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Anwen Shao
- Department of Neurosurgery, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Jianmin Zhang
- Department of Neurosurgery, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China.,Brain Research Institute, Zhejiang University, Hangzhou, China.,Collaborative Innovation Center for Brain Science, Zhejiang University, Hangzhou, China
| |
Collapse
|
22
|
Klén R, Karhunen M, Elo LL. Likelihood contrasts: a machine learning algorithm for binary classification of longitudinal data. Sci Rep 2020; 10:1016. [PMID: 31974488 PMCID: PMC6978422 DOI: 10.1038/s41598-020-57924-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Accepted: 12/31/2019] [Indexed: 12/02/2022] Open
Abstract
Machine learning methods have gained increased popularity in biomedical research during the recent years. However, very few of them support the analysis of longitudinal data, where several samples are collected from an individual over time. Additionally, most of the available longitudinal machine learning methods assume that the measurements are aligned in time, which is often not the case in real data. Here, we introduce a robust longitudinal machine learning method, named likelihood contrasts (LC), which supports study designs with unaligned time points. Our LC method is a binary classifier, which uses linear mixed models for modelling and log-likelihood for decision making. To demonstrate the benefits of our approach, we compared it with existing methods in four simulated and three real data sets. In each simulated data set, LC was the most accurate method, while the real data sets further supported the robust performance of the method. LC is also computationally efficient and easy to use.
Collapse
Affiliation(s)
- Riku Klén
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.,Turku PET Centre, University of Turku, Turku, Finland
| | - Markku Karhunen
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Laura L Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.
| |
Collapse
|
23
|
Leclercq M, Vittrant B, Martin-Magniette ML, Scott Boyer MP, Perin O, Bergeron A, Fradet Y, Droit A. Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data. Front Genet 2019; 10:452. [PMID: 31156708 PMCID: PMC6532608 DOI: 10.3389/fgene.2019.00452] [Citation(s) in RCA: 63] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 04/30/2019] [Indexed: 12/11/2022] Open
Abstract
The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML.
Collapse
Affiliation(s)
- Mickael Leclercq
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Benjamin Vittrant
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Marie Laure Martin-Magniette
- Institute of Plant Sciences Paris Saclay IPS2, CNRS, INRA, Université Paris-Sud, Université Evry, Université Paris-Saclay, Paris Diderot, Sorbonne Paris-Cité, Orsay, France.,UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, Paris, France
| | - Marie Pier Scott Boyer
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Olivier Perin
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Alain Bergeron
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Chirurgie, Oncology Axis, Université Laval, Québec City, QC, Canada
| | - Yves Fradet
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Chirurgie, Oncology Axis, Université Laval, Québec City, QC, Canada
| | - Arnaud Droit
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| |
Collapse
|
24
|
Patil S, Naik G, Pai R, Gad R. Stacked Autoencoder for classification of glioma grade III and grade IV. Biomed Signal Process Control 2018. [DOI: 10.1016/j.bspc.2018.07.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
25
|
Lopez R, Wang R, Seelig G. A molecular multi-gene classifier for disease diagnostics. Nat Chem 2018; 10:746-754. [PMID: 29713032 DOI: 10.1038/s41557-018-0056-1] [Citation(s) in RCA: 101] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2017] [Accepted: 03/29/2018] [Indexed: 11/09/2022]
Abstract
Despite its early promise as a diagnostic and prognostic tool, gene expression profiling remains cost-prohibitive and challenging to implement in a clinical setting. Here, we introduce a molecular computation strategy for analysing the information contained in complex gene expression signatures without the need for costly instrumentation. Our workflow begins by training a computational classifier on labelled gene expression data. This in silico classifier is then realized at the molecular level to enable expression analysis and classification of previously uncharacterized samples. Classification occurs through a series of molecular interactions between RNA inputs and engineered DNA probes designed to differentially weigh each input according to its importance. We validate our technology with two applications: a classifier for early cancer diagnostics and a classifier for differentiating viral and bacterial respiratory infections based on host gene expression. Together, our results demonstrate a general and modular framework for low-cost gene expression analysis.
Collapse
Affiliation(s)
- Randolph Lopez
- Department of Bioengineering, University of Washington, Seattle, WA, USA.,Molecular Engineering & Sciences Institute, University of Washington, Seattle, WA, USA
| | - Ruofan Wang
- Department of Biology, University of Washington, Seattle, WA, USA.,Department of Microbiology, University of Washington, Seattle, WA, USA
| | - Georg Seelig
- Molecular Engineering & Sciences Institute, University of Washington, Seattle, WA, USA. .,Department of Electrical Engineering, University of Washington, Seattle, WA, USA. .,Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA.
| |
Collapse
|
26
|
|
27
|
Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification. J Biomed Inform 2017; 67:11-20. [DOI: 10.1016/j.jbi.2017.01.016] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2016] [Revised: 01/24/2017] [Accepted: 01/31/2017] [Indexed: 12/24/2022]
|
28
|
Dessì N, Pes B, Cannas LM. An Evolutionary Approach for Balancing Effectiveness and Representation Level in Gene Selection. JOURNAL OF INFORMATION TECHNOLOGY RESEARCH 2015. [DOI: 10.4018/jitr.2015040102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
As data mining develops and expands to new application areas, feature selection also reveals various aspects to be considered. This paper underlines two aspects that seem to categorize the large body of available feature selection algorithms: the effectiveness and the representation level. The effectiveness deals with selecting the minimum set of variables that maximize the accuracy of a classifier and the representation level concerns discovering how relevant the variables are for the domain of interest. For balancing the above aspects, the paper proposes an evolutionary framework for feature selection that expresses a hybrid method, organized in layers, each of them exploits a specific model of search strategy. Extensive experiments on gene selection from DNA-microarray datasets are presented and discussed. Results indicate that the framework compares well with different hybrid methods proposed in literature as it has the capability of finding well suited subsets of informative features while improving classification accuracy.
Collapse
Affiliation(s)
- Nicoletta Dessì
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| | - Barbara Pes
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| | - Laura Maria Cannas
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| |
Collapse
|