1
|
Xie W, Wang L, Yu K, Shi T, Li W. Improved multi-layer binary firefly algorithm for optimizing feature selection and classification of microarray data. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
2
|
Bhandari N, Walambe R, Kotecha K, Khare SP. A comprehensive survey on computational learning methods for analysis of gene expression data. Front Mol Biosci 2022; 9:907150. [PMID: 36458095 PMCID: PMC9706412 DOI: 10.3389/fmolb.2022.907150] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 09/28/2022] [Indexed: 09/19/2023] Open
Abstract
Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.
Collapse
Affiliation(s)
- Nikita Bhandari
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
| | - Rahee Walambe
- Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Ketan Kotecha
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Satyajeet P. Khare
- Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, India
| |
Collapse
|
3
|
Synthesis of Silver Nano Particles Using Myricetin and the In-Vitro Assessment of Anti-Colorectal Cancer Activity: In-Silico Integration. Int J Mol Sci 2022; 23:ijms231911024. [PMID: 36232319 PMCID: PMC9570303 DOI: 10.3390/ijms231911024] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Revised: 08/29/2022] [Accepted: 09/15/2022] [Indexed: 12/24/2022] Open
Abstract
The creation of novel anticancer treatments for a variety of human illnesses, including different malignancies and dangerous microbes, also potentially depends on nanoparticles including silver. Recently, it has been successful to biologically synthesize metal nanoparticles using plant extracts. The natural flavonoid 3,3′, 4′, 5,5′, and 7 hexahydroxyflavon (myricetin) has anticancer properties. There is not much known about the regulatory effects of myricetin on the possible cell fate-determination mechanisms (such as apoptosis/proliferation) in colorectal cancer. Because the majority of investigations related to the anticancer activity of myricetin have dominantly focused on the enhancement of tumor cell uncontrolled growth (i.e., apoptosis). Thus, we have decided to explore the potential myricetin interactors and the associated biological functions by using an in-silico approach. Then, we focused on the main goal of the work which involved the synthesis of silver nanoparticles and the labeling of myricetin with it. The synthesized silver nanoparticles were examined using UV-visible spectroscopy, dynamic light scattering spectroscopy, Fourier transform infrared spectroscopy, and scanning electron microscopy. In this study, we have investigated the effects of myricetin on colorectal cancer where numerous techniques were used to show myricetin’s effect on colon cancer cells. Transmission Electron Microscopy was employed to monitor morphological changes. Furthermore, we have combined the results of the colorectal cancer gene expression dataset with those of the myricetin interactors and pathways. Based on the results, we conclude that myricetin is able to efficiently kill human colorectal cancer cell lines. Since, it shares important biological roles and possible route components and this myricetin may be a promising herbal treatment for colorectal cancer as per an in-silico analysis of the TCGA dataset.
Collapse
|
4
|
Murphy RG, Gilmore A, Senevirathne S, O'Reilly PG, LaBonte Wilson M, Jain S, McArt DG. Particle Swarm Optimization Artificial Intelligence technique for gene signature discovery in transcriptomic cohorts. Comput Struct Biotechnol J 2022; 20:5547-5563. [PMID: 36249564 PMCID: PMC9556859 DOI: 10.1016/j.csbj.2022.09.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 09/22/2022] [Accepted: 09/22/2022] [Indexed: 11/12/2022] Open
Abstract
EBPSO identifies unique, accurate, and succinct gene signatures. Key genes within the signatures provide biological insights its associated functions. A web-based micro-framework developed for ease of use and real-time visualizations. A promising alternative to traditional single gene signature generation. Downstream analysis will better translate these signatures towards clinical translation.
The development of gene signatures is key for delivering personalized medicine, despite only a few signatures being available for use in the clinic for cancer patients. Gene signature discovery tends to revolve around identifying a single signature. However, it has been shown that various highly predictive signatures can be produced from the same dataset. This study assumes that the presentation of top ranked signatures will allow greater efforts in the selection of gene signatures for validation on external datasets and for their clinical translation. Particle swarm optimization (PSO) is an evolutionary algorithm often used as a search strategy and largely represented as binary PSO (BPSO) in this domain. BPSO, however, fails to produce succinct feature sets for complex optimization problems, thus affecting its overall runtime and optimization performance. Enhanced BPSO (EBPSO) was developed to overcome these shortcomings. Thus, this study will validate unique candidate gene signatures for different underlying biology from EBPSO on transcriptomics cohorts. EBPSO was consistently seen to be as accurate as BPSO with substantially smaller feature signatures and significantly faster runtimes. 100% accuracy was achieved in all but two of the selected data sets. Using clinical transcriptomics cohorts, EBPSO has demonstrated the ability to identify accurate, succinct, and significantly prognostic signatures that are unique from one another. This has been proposed as a promising alternative to overcome the issues regarding traditional single gene signature generation. Interpretation of key genes within the signatures provided biological insights into the associated functions that were well correlated to their cancer type.
Collapse
|
5
|
A novel biomarker selection method combining graph neural network and gene relationships applied to microarray data. BMC Bioinformatics 2022; 23:303. [PMID: 35883022 PMCID: PMC9327232 DOI: 10.1186/s12859-022-04848-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 07/15/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The discovery of critical biomarkers is significant for clinical diagnosis, drug research and development. Researchers usually obtain biomarkers from microarray data, which comes from the dimensional curse. Feature selection in machine learning is usually used to solve this problem. However, most methods do not fully consider feature dependence, especially the real pathway relationship of genes. RESULTS Experimental results show that the proposed method is superior to classical algorithms and advanced methods in feature number and accuracy, and the selected features have more significance. METHOD This paper proposes a feature selection method based on a graph neural network. The proposed method uses the actual dependencies between features and the Pearson correlation coefficient to construct graph-structured data. The information dissemination and aggregation operations based on graph neural network are applied to fuse node information on graph structured data. The redundant features are clustered by the spectral clustering method. Then, the feature ranking aggregation model using eight feature evaluation methods acts on each clustering sub-cluster for different feature selection. CONCLUSION The proposed method can effectively remove redundant features. The algorithm's output has high stability and classification accuracy, which can potentially select potential biomarkers.
Collapse
|
6
|
You Y, Lai X, Pan Y, Zheng H, Vera J, Liu S, Deng S, Zhang L. Artificial intelligence in cancer target identification and drug discovery. Signal Transduct Target Ther 2022; 7:156. [PMID: 35538061 PMCID: PMC9090746 DOI: 10.1038/s41392-022-00994-0] [Citation(s) in RCA: 123] [Impact Index Per Article: 41.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Revised: 03/14/2022] [Accepted: 04/05/2022] [Indexed: 02/08/2023] Open
Abstract
Artificial intelligence is an advanced method to identify novel anticancer targets and discover novel drugs from biology networks because the networks can effectively preserve and quantify the interaction between components of cell systems underlying human diseases such as cancer. Here, we review and discuss how to employ artificial intelligence approaches to identify novel anticancer targets and discover drugs. First, we describe the scope of artificial intelligence biology analysis for novel anticancer target investigations. Second, we review and discuss the basic principles and theory of commonly used network-based and machine learning-based artificial intelligence algorithms. Finally, we showcase the applications of artificial intelligence approaches in cancer target identification and drug discovery. Taken together, the artificial intelligence models have provided us with a quantitative framework to study the relationship between network characteristics and cancer, thereby leading to the identification of potential anticancer targets and the discovery of novel drug candidates.
Collapse
Affiliation(s)
- Yujie You
- College of Computer Science, Sichuan University, Chengdu, 610065, China
| | - Xin Lai
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Universitätsklinikum Erlangen, Erlangen, 91052, Germany
| | - Yi Pan
- Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Room D513, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, 518055, China
| | - Huiru Zheng
- School of Computing, Ulster University, Belfast, BT15 1ED, UK
| | - Julio Vera
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Universitätsklinikum Erlangen, Erlangen, 91052, Germany
| | - Suran Liu
- College of Computer Science, Sichuan University, Chengdu, 610065, China
| | - Senyi Deng
- Institute of Thoracic Oncology, Department of Thoracic Surgery, West China Hospital, Sichuan University, Chengdu, 610065, China.
| | - Le Zhang
- College of Computer Science, Sichuan University, Chengdu, 610065, China.
- Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou, 310024, China.
- Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024, China.
| |
Collapse
|
7
|
Pashaei E, Pashaei E. An efficient binary chimp optimization algorithm for feature selection in biomedical data classification. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06775-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
8
|
Yu K, Xie W, Wang L, Zhang S, Li W. Determination of biomarkers from microarray data using graph neural network and spectral clustering. Sci Rep 2021; 11:23828. [PMID: 34903818 PMCID: PMC8668890 DOI: 10.1038/s41598-021-03316-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Accepted: 12/02/2021] [Indexed: 11/26/2022] Open
Abstract
In bioinformatics, the rapid development of gene sequencing technology has produced an increasing amount of microarray data. This type of data shares the typical characteristics of small sample size and high feature dimensions. Searching for biomarkers from microarray data, which expression features of various diseases, is essential for the disease classification. feature selection has therefore became fundemental for the analysis of microarray data, which designs to remove irrelevant and redundant features. There are a large number of redundant features and irrelevant features in microarray data, which severely degrade the classification effectiveness. We propose an innovative feature selection method with the goal of obtaining feature dependencies from a priori knowledge and removing redundant features using spectral clustering. In this paper, the graph structure is firstly constructed by using the gene interaction network as a priori knowledge, and then a link prediction method based on graph neural network is proposed to enhance the graph structure data. Finally, a feature selection method based on spectral clustering is proposed to determine biomarkers. The classification accuracy on DLBCL and Prostate can be improved by 10.90% and 16.22% compared to traditional methods. Link prediction provides an average classification accuracy improvement of 1.96% and 1.31%, and is up to 16.98% higher than the published method. The results show that the proposed method can have full use of a priori knowledge to effectively select disease prediction biomarkers with high classification accuracy.
Collapse
Affiliation(s)
- Kun Yu
- College of Medicine and Bioinformation Engineering, Northeastern University, Shenyang, China
| | - Weidong Xie
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Linjie Wang
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Shoujia Zhang
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image MIIC, Northeastern University, Ministry of Education, Shenyang, China.
| |
Collapse
|
9
|
Bajrai LH, Sohrab SS, Alandijany TA, Mobashir M, Reyaz M, Kamal MA, Firoz A, Parveen S, Azhar EI. Gene Expression Profiling of Early Acute Febrile Stage of Dengue Infection and Its Comparative Analysis With Streptococcus pneumoniae Infection. Front Cell Infect Microbiol 2021; 11:707905. [PMID: 34778101 PMCID: PMC8581568 DOI: 10.3389/fcimb.2021.707905] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 09/30/2021] [Indexed: 02/05/2023] Open
Abstract
Infectious diseases are the disorders caused by organisms such as bacteria, viruses, fungi, or parasites. Although many of them are permentantly hazardous, a number of them live in and on our bodies and they are normally harmless or even helpful. Under certain circumstances, some organisms may cause diseases and these infectious diseases may be passed directly from person to person or via intermediate vectors including insects and other animals. Dengue virus and Streptococcus pneumoniae are the critical and common sources of infectious diseases. So, it is critical to understand the gene expression profiling and their inferred functions in comparison to the normal and virus infected conditions. Here, we have analyzed the gene expression profiling for dengue hemorrhagic fever, dengue fever, and normal human dataset. Similar to it, streptococcus pneumoniae infectious data were analyzed and both the outcomes were compared. Our study leads to the conclusion that the dengue hemorrhagic fever arises in result to potential change in the gene expression pattern, and the inferred functions obviously belong to the immune system, but also there are some additional potential pathways which are critical signaling pathways. In the case of pneumoniae infection, 19 pathways were enriched, almost all these pathways are associated with the immune system and 17 of the enriched pathways were common with dengue infection except platelet activation and antigen processing and presentation. In terms of the comparative study between dengue virus and Streptococcus pneumoniae infection, we conclude that cell adhesion molecules (CAMs), MAPK signaling pathway, natural killer cell mediated cytotoxicity, regulation of actin cytoskeleton, and cytokine-cytokine receptor interaction are commonly enriched in all the three cases of dengue infection and Streptococcus pneumoniae infection, focal adhesion was enriched between classical dengue fever — dengue hemorrhagic fever, dengue hemorrhagic fever—normal samples, and SP, and antigen processing and presentation and Leukocyte transendothelial migration were enriched in classical dengue fever —normal samples, dengue hemorrhagic fever—normal samples, and Streptococcus pneumoniae infection.
Collapse
Affiliation(s)
- Leena H Bajrai
- Special Infectious Agents Unit - BSL-3, King Fahd Medical Research Centre, King Abdulaziz University, Jeddah, Saudi Arabia.,Biochemistry Department, Faculty of Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sayed S Sohrab
- Special Infectious Agents Unit - BSL-3, King Fahd Medical Research Centre, King Abdulaziz University, Jeddah, Saudi Arabia.,Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Thamir A Alandijany
- Special Infectious Agents Unit - BSL-3, King Fahd Medical Research Centre, King Abdulaziz University, Jeddah, Saudi Arabia.,Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mohammad Mobashir
- SciLifeLab, Department of Oncology and Pathology Karolinska Institutet, Stockholm, Sweden
| | - Muddassir Reyaz
- Department of Healthcare Management, Jamia Hamdard Hamdard Nagar, New Delhi, India
| | - Mohammad A Kamal
- West China School of Nursing/Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.,King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia.,Enzymoics, Novel Global Community Educational Foundation, Hebersham, NSW, Australia
| | - Ahmad Firoz
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
| | - Shabana Parveen
- Department of Bioscience, Jamia Millia Islamia, New Delhi, India
| | - Esam I Azhar
- Special Infectious Agents Unit - BSL-3, King Fahd Medical Research Centre, King Abdulaziz University, Jeddah, Saudi Arabia.,Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
10
|
Yu K, Xie W, Wang L, Li W. ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data. BMC Bioinformatics 2021; 22:514. [PMID: 34686127 PMCID: PMC8532312 DOI: 10.1186/s12859-021-04443-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 10/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Finding significant genes or proteins from gene chip data for disease diagnosis and drug development is an important task. However, the challenge comes from the curse of the data dimension. It is of great significance to use machine learning methods to find important features from the data and build an accurate classification model. RESULTS The proposed method has proved superior to the published advanced hybrid feature selection method and traditional feature selection method on different public microarray data sets. In addition, the biomarkers selected using our method show a match to those provided by the cooperative hospital in a set of clinical cleft lip and palate data. METHOD In this paper, a feature selection algorithm ILRC based on clustering and improved L1 regularization is proposed. The features are firstly clustered, and the redundant features in the sub-clusters are deleted. Then all the remaining features are iteratively evaluated using ILR. The final result is given according to the cumulative weight reordering. CONCLUSION The proposed method can effectively remove redundant features. The algorithm's output has high stability and classification accuracy, which can potentially select potential biomarkers.
Collapse
Affiliation(s)
- Kun Yu
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, China
| | - Weidong Xie
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Linjie Wang
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, Ministry of Education, Shenyang, China
| |
Collapse
|
11
|
Zhong L, Meng Q, Chen Y, Du L, Wu P. A laminar augmented cascading flexible neural forest model for classification of cancer subtypes based on gene expression data. BMC Bioinformatics 2021; 22:475. [PMID: 34600466 PMCID: PMC8487515 DOI: 10.1186/s12859-021-04391-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 09/22/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Correctly classifying the subtypes of cancer is of great significance for the in-depth study of cancer pathogenesis and the realization of personalized treatment for cancer patients. In recent years, classification of cancer subtypes using deep neural networks and gene expression data has gradually become a research hotspot. However, most classifiers may face overfitting and low classification accuracy when dealing with small sample size and high-dimensional biology data. RESULTS In this paper, a laminar augmented cascading flexible neural forest (LACFNForest) model was proposed to complete the classification of cancer subtypes. This model is a cascading flexible neural forest using deep flexible neural forest (DFNForest) as the base classifier. A hierarchical broadening ensemble method was proposed, which ensures the robustness of classification results and avoids the waste of model structure and function as much as possible. We also introduced an output judgment mechanism to each layer of the forest to reduce the computational complexity of the model. The deep neural forest was extended to the densely connected deep neural forest to improve the prediction results. The experiments on RNA-seq gene expression data showed that LACFNForest has better performance in the classification of cancer subtypes compared to the conventional methods. CONCLUSION The LACFNForest model effectively improves the accuracy of cancer subtype classification with good robustness. It provides a new approach for the ensemble learning of classifiers in terms of structural design.
Collapse
Affiliation(s)
- Lianxin Zhong
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Qingfang Meng
- School of Information Science and Engineering, University of Jinan, Jinan, China.
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China.
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Lei Du
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Peng Wu
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| |
Collapse
|
12
|
Zheng X, Zhang C. Gene selection for microarray data classification via dual latent representation learning. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.07.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
13
|
Eldakhakhny BM, Al Sadoun H, Choudhry H, Mobashir M. In-Silico Study of Immune System Associated Genes in Case of Type-2 Diabetes With Insulin Action and Resistance, and/or Obesity. Front Endocrinol (Lausanne) 2021; 12:641888. [PMID: 33927693 PMCID: PMC8078136 DOI: 10.3389/fendo.2021.641888] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 03/08/2021] [Indexed: 12/11/2022] Open
Abstract
Type-2 diabetes and obesity are among the leading human diseases and highly complex in terms of diagnostic and therapeutic approaches and are among the most frequent and highly complex and heterogeneous in nature. Based on epidemiological evidence, it is known that the patients suffering from obesity are considered to be at a significantly higher risk of type-2 diabetes. There are several pieces of evidence that support the hypothesis that these diseases interlinked and obesity may aggravate the risk(s) of type-2 diabetes. Multi-level unwanted alterations such as (epi-) genetic alterations, changes at the transcriptional level, and altered signaling pathways (receptor, cytoplasmic, and nuclear level) are the major sources that promote several complex diseases, and such a heterogeneous level of complexity is considered as a major barrier in the development of therapeutics. With so many known challenges, it is critical to understand the relationships and the shared causes between type-2 diabetes and obesity, and these are difficult to unravel and understand. For this purpose, we have selected publicly available datasets of gene expression for obesity and type-2 diabetes, have unraveled the genes and the pathways associated with the immune system, and have also focused on the T-cell signaling pathway and its components. We have applied a simplified computational approach to understanding differential gene expression and patterns and the enriched pathways for obesity and type-2 diabetes. Furthermore, we have also analyzed genes by using network-level understanding. In the analysis, we observe that there are fewer genes that are commonly differentially expressed while a comparatively higher number of pathways are shared between them. There are only 4 pathways that are associated with the immune system in case of obesity and 10 immune-associated pathways in case of type-2 diabetes, and, among them, only 2 pathways are commonly altered. Furthermore, we have presented SPNS1, PTPN6, CD247, FOS, and PIK3R5 as the overexpressed genes, which are the direct components of TCR signaling.
Collapse
Affiliation(s)
- Basmah Medhat Eldakhakhny
- Department of Clinical Biochemistry, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Hadeel Al Sadoun
- Stem Cell Unit, Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Hani Choudhry
- Cancer and Mutagenesis Unit, Department of Biochemistry, Cancer Metabolism and Epigenetic Unit, Faculty of Science, King Fahd Medical Research Center, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mohammad Mobashir
- SciLifeLab, Department of Oncology and Pathology, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
14
|
Peng C, Wu X, Yuan W, Zhang X, Zhang Y, Li Y. MGRFE: Multilayer Recursive Feature Elimination Based on an Embedded Genetic Algorithm for Cancer Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:621-632. [PMID: 31180870 DOI: 10.1109/tcbb.2019.2921961] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Microarray gene expression data have become a topic of great interest for cancer classification and for further research in the field of bioinformatics. Nonetheless, due to the "large p, small n" paradigm of limited biosamples and high-dimensional data, gene selection is becoming a demanding task, which is aimed at selecting a minimal number of discriminatory genes associated closely with a phenotype. Feature or gene selection is still a challenging problem owing to its nondeterministic polynomial time complexity and thus most of the existing feature selection algorithms utilize heuristic rules. A multilayer recursive feature elimination method based on an embedded integer-coded genetic algorithm, MGRFE, is proposed here, which is aimed at selecting the gene combination with minimal size and maximal information. On the basis of 19 benchmark microarray datasets including multiclass and imbalanced datasets, MGRFE outperforms state-of-the-art feature selection algorithms with better cancer classification accuracy and a smaller selected gene number. MGRFE could be regarded as a promising feature selection method for high-dimensional datasets especially gene expression data. Moreover, the genes selected by MGRFE have close biological relevance to cancer phenotypes. The source code of our proposed algorithm and all the 19 datasets used in this paper are available at https://github.com/Pengeace/MGRFE-GaRFE.
Collapse
|
15
|
Categorizing SHR and WKY rats by chi2 algorithm and decision tree. Sci Rep 2021; 11:3463. [PMID: 33568725 PMCID: PMC7876131 DOI: 10.1038/s41598-021-82864-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 01/18/2021] [Indexed: 11/08/2022] Open
Abstract
Classifying mental disorder is a big issue in psychology in recent years. This article focuses on offering a relation between decision tree and encoding of fMRI that can simplify the analysis of different mental disorders and has a high ROC over 0.9. Here we encode fMRI information to the power-law distribution with integer elements by the graph theory in which the network is characterized by degrees that measure the number of effective links exceeding the threshold of Pearson correlation among voxels. When the degrees are ranked from low to high, the network equation can be fit by the power-law distribution. Here we use the mentally disordered SHR and WKY rats as samples and employ decision tree from chi2 algorithm to classify different states of mental disorder. This method not only provides the decision tree and encoding, but also enables the construction of a transformation matrix that is capable of connecting different metal disorders. Although the latter attempt is still in its fancy, it may have a contribution to unraveling the mystery of psychological processes.
Collapse
|
16
|
Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet 2020; 11:603808. [PMID: 33362861 PMCID: PMC7758324 DOI: 10.3389/fgene.2020.603808] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 10/29/2020] [Indexed: 12/20/2022] Open
Abstract
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P. M. Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
| |
Collapse
|
17
|
Antonakoudis A, Barbosa R, Kotidis P, Kontoravdi C. The era of big data: Genome-scale modelling meets machine learning. Comput Struct Biotechnol J 2020; 18:3287-3300. [PMID: 33240470 PMCID: PMC7663219 DOI: 10.1016/j.csbj.2020.10.011] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/07/2020] [Accepted: 10/08/2020] [Indexed: 12/15/2022] Open
Abstract
With omics data being generated at an unprecedented rate, genome-scale modelling has become pivotal in its organisation and analysis. However, machine learning methods have been gaining ground in cases where knowledge is insufficient to represent the mechanisms underlying such data or as a means for data curation prior to attempting mechanistic modelling. We discuss the latest advances in genome-scale modelling and the development of optimisation algorithms for network and error reduction, intracellular constraining and applications to strain design. We further review applications of supervised and unsupervised machine learning methods to omics datasets from microbial and mammalian cell systems and present efforts to harness the potential of both modelling approaches through hybrid modelling.
Collapse
Affiliation(s)
| | | | | | - Cleo Kontoravdi
- Department of Chemical Engineering, Imperial College London, London SW7 2AZ, United Kingdom
| |
Collapse
|
18
|
Krishnamoorthy PKP, Kamal MA, Warsi MK, Alnajeebi AM, Ali HA, Helmi N, Izhari MA, Mustafa S, Firoz A, Mobashir M. In-silico study reveals immunological signaling pathways, their genes, and potential herbal drug targets in ovarian cancer. INFORMATICS IN MEDICINE UNLOCKED 2020. [DOI: 10.1016/j.imu.2020.100422] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
|
19
|
Chen T. The classification of gene expression profiles based on improved rotation forest algorithm. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-179115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Tao Chen
- School of Mathematics and Computer Science, Shaanxi University of Technology, Hanzhong, China
| |
Collapse
|
20
|
Hamidian AH, Esfandeh S, Zhang Y, Yang M. Simulation and optimization of nanomaterials application for heavy metal removal from aqueous solutions. INORG NANO-MET CHEM 2019. [DOI: 10.1080/24701556.2019.1653321] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Amir Hossein Hamidian
- Department of Environmental Science and Engineering, Faculty of Natural Resources, University of Tehran, Karaj, Iran
- State Key Laboratory of Environmental Aquatic Chemistry, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing, China
| | - Sorour Esfandeh
- Department of Environmental Science and Engineering, Faculty of Natural Resources, University of Tehran, Karaj, Iran
| | - Yu Zhang
- State Key Laboratory of Environmental Aquatic Chemistry, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, PR China
| | - Min Yang
- State Key Laboratory of Environmental Aquatic Chemistry, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, PR China
| |
Collapse
|
21
|
Sharma A, Rani R. C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 178:219-235. [PMID: 31416551 DOI: 10.1016/j.cmpb.2019.06.029] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Revised: 06/24/2019] [Accepted: 06/27/2019] [Indexed: 05/21/2023]
Abstract
BACKGROUND AND OBJECTIVE Over the last two decades, DNA microarray technology has emerged as a powerful tool for early cancer detection and prevention. It helps to provide a detailed overview of disease complex microenvironment. Moreover, online availability of thousands of gene expression assays made microarray data classification an active research area. A common goal is to find a minimum subset of genes and maximizing the classification accuracy. METHODS In pursuit of a similar objective, we have proposed framework (C-HMOSHSSA) for gene selection using multi-objective spotted hyena optimizer (MOSHO) and salp swarm algorithm (SSA). The real-life optimization problems with more than one objective usually face the challenge to maintain convergence and diversity. Salp Swarm Algorithm (SSA) maintains diversity but, suffers from the overhead of maintaining the necessary information. On the other hand, the calculation of MOSHO requires low computational efforts hence is used for maintaining the necessary information. Therefore, the proposed algorithm is a hybrid algorithm that utilizes the features of both SSA and MOSHO to facilitate its exploration and exploitation capability. RESULTS Four different classifiers are trained on seven high-dimensional datasets using a subset of features (genes), which are obtained after applying the proposed hybrid gene selection algorithm. The results show that the proposed technique significantly outperforms existing state-of-the-art techniques. CONCLUSION It is also shown that the new sets of informative and biologically relevant genes are successfully identified by the proposed technique. The proposed approach can also be applied to other problem domains of interest which involve feature selection.
Collapse
Affiliation(s)
- Aman Sharma
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| | - Rinkle Rani
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| |
Collapse
|
22
|
Ning Z, Feng C, Song C, Liu W, Shang D, Li M, Wang Q, Zhao J, Liu Y, Chen J, Yu X, Zhang J, Li C. Topologically inferring active miRNA-mediated subpathways toward precise cancer classification by directed random walk. Mol Oncol 2019; 13:2211-2226. [PMID: 31408573 PMCID: PMC6763789 DOI: 10.1002/1878-0261.12563] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 08/05/2019] [Accepted: 08/12/2019] [Indexed: 02/06/2023] Open
Abstract
Accurate predictions of classification biomarkers and disease status are indispensable for clinical cancer diagnosis and research. However, the robustness of conventional gene biomarkers is limited by issues with reproducibility across different measurement platforms and cohorts of patients. In this study, we collected 4775 samples from 12 different cancer datasets, which contained 4636 TCGA samples and 139 GEO samples. A new method was developed to detect miRNA‐mediated subpathway activities by using directed random walk (miDRW). To calculate the activity of each miRNA‐mediated subpathway, we constructed a global directed pathway network (GDPN) with genes as nodes. We then identified miRNAs with expression levels which were strongly inversely correlated with differentially expressed target genes in the GDPN. Finally, each miRNA‐mediated subpathway activity was integrated with the topological information, differential levels of miRNAs and genes, expression levels of genes, and target relationships between miRNAs and genes. The results showed that the proposed method yielded a more robust and accurate overall performance compared with other existing pathway‐based, miRNA‐based, and gene‐based classification methods. The high‐frequency miRNA‐mediated subpathways are more reliable in classifying samples and for selecting therapeutic strategies.
Collapse
Affiliation(s)
- Ziyu Ning
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Chenchen Feng
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Chao Song
- School of Pharmacology, Harbin Medical University, Daqing, China
| | - Wei Liu
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, China
| | - Desi Shang
- College of Bioinformatics Science and Technology, Harbin Medical University, China
| | - Meng Li
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Qiuyu Wang
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Jianmei Zhao
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Yuejuan Liu
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Jiaxin Chen
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Xiaoyang Yu
- The Higher Educational Key Laboratory for Measuring & Control Technology and Instrumentations of Heilongjiang Province, Harbin University of Science and Technology, China
| | - Jian Zhang
- School of Medical Informatics, Harbin Medical University, Daqing, China
| | - Chunquan Li
- School of Medical Informatics, Harbin Medical University, Daqing, China
| |
Collapse
|
23
|
Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning. Gene 2019; 706:188-200. [DOI: 10.1016/j.gene.2019.04.060] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2018] [Revised: 04/03/2019] [Accepted: 04/22/2019] [Indexed: 01/19/2023]
|
24
|
Ye M, Wang W, Yao C, Fan R, Wang P. Gene Selection Method for Microarray Data Classification Using Particle Swarm Optimization and Neighborhood Rough Set. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190204150918] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Mining knowledge from microarray data is one of the popular research
topics in biomedical informatics. Gene selection is a significant research trend in biomedical data
mining, since the accuracy of tumor identification heavily relies on the genes biologically relevant
to the identified problems.
Objective:
In order to select a small subset of informative genes from numerous genes for tumor
identification, various computational intelligence methods were presented. However, due to the
high data dimensions, small sample size, and the inherent noise available, many computational
methods confront challenges in selecting small gene subset.
Methods:
In our study, we propose a novel algorithm PSONRS_KNN for gene selection based on
the particle swarm optimization (PSO) algorithm along with the neighborhood rough set (NRS) reduction
model and the K-nearest neighborhood (KNN) classifier.
Results:
First, the top-ranked candidate genes are obtained by the GainRatioAttributeEval preselection
algorithm in WEKA. Then, the minimum possible meaningful set of genes is selected by
combining PSO with NRS and KNN classifier.
Conclusion:
Experimental results on five microarray gene expression datasets demonstrate that the
performance of the proposed method is better than existing state-of-the-art methods in terms of
classification accuracy and the number of selected genes.
Collapse
Affiliation(s)
- Mingquan Ye
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Weiwei Wang
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Chuanwen Yao
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Rong Fan
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Peipei Wang
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| |
Collapse
|
25
|
Alanni R, Hou J, Azzawi H, Xiang Y. A novel gene selection algorithm for cancer classification using microarray datasets. BMC Med Genomics 2019; 12:10. [PMID: 30646919 PMCID: PMC6334429 DOI: 10.1186/s12920-018-0447-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 12/07/2018] [Indexed: 12/18/2022] Open
Abstract
Background Microarray datasets are an important medical diagnostic tool as they represent the states of a cell at the molecular level. Available microarray datasets for classifying cancer types generally have a fairly small sample size compared to the large number of genes involved. This fact is known as a curse of dimensionality, which is a challenging problem. Gene selection is a promising approach that addresses this problem and plays an important role in the development of efficient cancer classification due to the fact that only a small number of genes are related to the classification problem. Gene selection addresses many problems in microarray datasets such as reducing the number of irrelevant and noisy genes, and selecting the most related genes to improve the classification results. Methods An innovative Gene Selection Programming (GSP) method is proposed to select relevant genes for effective and efficient cancer classification. GSP is based on Gene Expression Programming (GEP) method with a new defined population initialization algorithm, a new fitness function definition, and improved mutation and recombination operators. . Support Vector Machine (SVM) with a linear kernel serves as a classifier of the GSP. Results Experimental results on ten microarray cancer datasets demonstrate that Gene Selection Programming (GSP) is effective and efficient in eliminating irrelevant and redundant genes/features from microarray datasets. The comprehensive evaluations and comparisons with other methods show that GSP gives a better compromise in terms of all three evaluation criteria, i.e., classification accuracy, number of selected genes, and computational cost. The gene set selected by GSP has shown its superior performances in cancer classification compared to those selected by the up-to-date representative gene selection methods. Conclusion Gene subset selected by GSP can achieve a higher classification accuracy with less processing time.
Collapse
Affiliation(s)
- Russul Alanni
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia.
| | - Jingyu Hou
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| | - Hasseeb Azzawi
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| | - Yong Xiang
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| |
Collapse
|
26
|
Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction. BioData Min 2018; 11:22. [PMID: 30386434 PMCID: PMC6203208 DOI: 10.1186/s13040-018-0184-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Accepted: 10/11/2018] [Indexed: 11/26/2022] Open
Abstract
Background In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open challenge: the analysis of data from different experiments is a fundamental practice for the study of diseases. In this work, we propose to combine DNA methylation and RNA sequencing NGS experiments at gene level for supervised knowledge extraction in cancer. Methods We retrieve DNA methylation and RNA sequencing datasets from The Cancer Genome Atlas (TCGA), focusing on the Breast Invasive Carcinoma (BRCA), the Thyroid Carcinoma (THCA), and the Kidney Renal Papillary Cell Carcinoma (KIRP). We combine the RNA sequencing gene expression values with the gene methylation quantity, as a new measure that we define for representing the methylation quantity associated to a gene. Additionally, we propose to analyze the combined data through tree- and rule-based classification algorithms (C4.5, Random Forest, RIPPER, and CAMUR). Results We extract more than 15,000 classification models (composed of gene sets), which allow to distinguish the tumoral samples from the normal ones with an average accuracy of 95%. From the integrated experiments we obtain about 5000 classification models that consider both the gene measures related to the RNA sequencing and the DNA methylation experiments. Conclusions We compare the sets of genes obtained from the classifications on RNA sequencing and DNA methylation data with the genes obtained from the integration of the two experiments. The comparison results in several genes that are in common among the single experiments and the integrated ones (733 for BRCA, 35 for KIRP, and 861 for THCA) and 509 genes that are in common among the different experiments. Finally, we investigate the possible relationships among the different analyzed tumors by extracting a core set of 13 genes that appear in all tumors. A preliminary functional analysis confirms the relation of part of those genes (5 out of 13 and 279 out of 509) with cancer, suggesting to focus further studies on the new individuated ones. Electronic supplementary material The online version of this article (10.1186/s13040-018-0184-6) contains supplementary material, which is available to authorized users.
Collapse
|
27
|
Sheik Abdullah A, Selvakumar S. Assessment of the risk factors for type II diabetes using an improved combination of particle swarm optimization and decision trees by evaluation with Fisher’s linear discriminant analysis. Soft comput 2018. [DOI: 10.1007/s00500-018-3555-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
28
|
Prasad Y, Biswas K, Hanmandlu M. A recursive PSO scheme for gene selection in microarray data. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2018.06.019] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
29
|
Abstract
The increasingly rapid creation, sharing and exchange of information nowadays put researchers and data scientists ahead of a challenging task of data analysis and extracting relevant information out of data. To be able to learn from data, the dimensionality of the data should be reduced first. Feature selection (FS) can help to reduce the amount of data, but it is a very complex and computationally demanding task, especially in the case of high-dimensional datasets. Swarm intelligence (SI) has been proved as a technique which can solve NP-hard (Non-deterministic Polynomial time) computational problems. It is gaining popularity in solving different optimization problems and has been used successfully for FS in some applications. With the lack of comprehensive surveys in this field, it was our objective to fill the gap in coverage of SI algorithms for FS. We performed a comprehensive literature review of SI algorithms and provide a detailed overview of 64 different SI algorithms for FS, organized into eight major taxonomic categories. We propose a unified SI framework and use it to explain different approaches to FS. Different methods, techniques, and their settings are explained, which have been used for various FS aspects. The datasets used most frequently for the evaluation of SI algorithms for FS are presented, as well as the most common application areas. The guidelines on how to develop SI approaches for FS are provided to support researchers and analysts in their data mining tasks and endeavors while existing issues and open questions are being discussed. In this manner, using the proposed framework and the provided explanations, one should be able to design an SI approach to be used for a specific FS problem.
Collapse
|
30
|
Xia CQ, Han K, Qi Y, Zhang Y, Yu DJ. A Self-Training Subspace Clustering Algorithm under Low-Rank Representation for Cancer Classification on Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1315-1324. [PMID: 28600258 PMCID: PMC5986621 DOI: 10.1109/tcbb.2017.2712607] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation (LRR) is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering (SSC) method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes (RNF114, HLA-DRB5, USP9Y, and PTPN20) were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.
Collapse
|
31
|
Bi-stage hierarchical selection of pathway genes for cancer progression using a swarm based computational approach. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.10.024] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
32
|
Gene selection for microarray data classification via subspace learning and manifold regularization. Med Biol Eng Comput 2017; 56:1271-1284. [PMID: 29256006 DOI: 10.1007/s11517-017-1751-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Accepted: 11/03/2017] [Indexed: 10/18/2022]
Abstract
With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.
Collapse
|
33
|
Gao L, Ye M, Lu X, Huang D. Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification. GENOMICS PROTEOMICS & BIOINFORMATICS 2017; 15:389-395. [PMID: 29246519 PMCID: PMC5828665 DOI: 10.1016/j.gpb.2017.08.002] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Revised: 07/25/2017] [Accepted: 08/08/2017] [Indexed: 12/30/2022]
Abstract
It remains a great challenge to achieve sufficient cancer classification accuracy with the entire set of genes, due to the high dimensions, small sample size, and big noise of gene expression data. We thus proposed a hybrid gene selection method, Information Gain-Support Vector Machine (IG-SVM) in this study. IG was initially employed to filter irrelevant and redundant genes. Then, further removal of redundant genes was performed using SVM to eliminate the noise in the datasets more effectively. Finally, the informative genes selected by IG-SVM served as the input for the LIBSVM classifier. Compared to other related algorithms, IG-SVM showed the highest classification accuracy and superior performance as evaluated using five cancer gene expression datasets based on a few selected genes. As an example, IG-SVM achieved a classification accuracy of 90.32% for colon cancer, which is difficult to be accurately classified, only based on three genes including CSRP1, MYL9, and GUCA2B.
Collapse
Affiliation(s)
- Lingyun Gao
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Mingquan Ye
- School of Medical Information, Wannan Medical College, Wuhu 241002, China.
| | - Xiaojie Lu
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Daobin Huang
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| |
Collapse
|
34
|
Aziz R, Verma CK, Srivastava N. A novel approach for dimension reduction of microarray. Comput Biol Chem 2017; 71:161-169. [PMID: 29096382 DOI: 10.1016/j.compbiolchem.2017.10.009] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2016] [Revised: 01/04/2017] [Accepted: 10/27/2017] [Indexed: 10/18/2022]
Abstract
This paper proposes a new hybrid search technique for feature (gene) selection (FS) using Independent component analysis (ICA) and Artificial Bee Colony (ABC) called ICA+ABC, to select informative genes based on a Naïve Bayes (NB) algorithm. An important trait of this technique is the optimization of ICA feature vector using ABC. ICA+ABC is a hybrid search algorithm that combines the benefits of extraction approach, to reduce the size of data and wrapper approach, to optimize the reduced feature vectors. This hybrid search technique is facilitated by evaluating the performance of ICA+ABC on six standard gene expression datasets of classification. Extensive experiments were conducted to compare the performance of ICA+ABC with the results obtained from recently published Minimum Redundancy Maximum Relevance (mRMR) +ABC algorithm for NB classifier. Also to check the performance that how ICA+ABC works as feature selection with NB classifier, compared the combination of ICA with popular filter techniques and with other similar bio inspired algorithm such as Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). The result shows that ICA+ABC has a significant ability to generate small subsets of genes from the ICA feature vector, that significantly improve the classification accuracy of NB classifier compared to other previously suggested methods.
Collapse
Affiliation(s)
- Rabia Aziz
- Department of Mathematics & Computer Application, Maulana Azad National Institute of Technology, Bhopal, M.P., 462003, India.
| | - C K Verma
- Department of Mathematics & Computer Application, Maulana Azad National Institute of Technology, Bhopal, M.P., 462003, India
| | - Namita Srivastava
- Department of Mathematics & Computer Application, Maulana Azad National Institute of Technology, Bhopal, M.P., 462003, India
| |
Collapse
|
35
|
Identification of Biomarkers for Predicting Lymph Node Metastasis of Stomach Cancer Using Clinical DNA Methylation Data. DISEASE MARKERS 2017; 2017:5745724. [PMID: 28951630 PMCID: PMC5603126 DOI: 10.1155/2017/5745724] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Revised: 07/14/2017] [Accepted: 07/24/2017] [Indexed: 12/16/2022]
Abstract
Background Lymph node (LN) metastasis was an independent risk factor for stomach cancer recurrence, and the presence of LN metastasis has great influence on the overall survival of stomach cancer patients. Thus, accurate prediction of the presence of lymph node metastasis can provide guarantee of credible prognosis evaluation of stomach cancer patients. Recently, increasing evidence demonstrated that the aberrant DNA methylation first appears before symptoms of the disease become clinically apparent. Objective Selecting key biomarkers for LN metastasis presence prediction for stomach cancer using clinical DNA methylation based on a machine learning method. Methods To reduce the overfitting risk of prediction task, we applied a three-step feature selection method according to the property of DNA methylation data. Results The feature selection procedure extracted several cancer-related and lymph node metastasis-related genes, such as TP73, PDX1, FUT8, HOXD1, NMT1, and SEMA3E. The prediction performance was evaluated on the public DNA methylation dataset. The results showed that the three-step feature procedure can largely improve the prediction performance and implied the reliability of the biomarkers selected. Conclusions With the selected biomarkers, the prediction method can achieve higher accuracy in detecting LN metastasis and the results also proved the reliability of the selected biomarkers indirectly.
Collapse
|
36
|
Verification of Three-Phase Dependency Analysis Bayesian Network Learning Method for Maize Carotenoid Gene Mining. BIOMED RESEARCH INTERNATIONAL 2017; 2017:1813494. [PMID: 28828382 PMCID: PMC5554554 DOI: 10.1155/2017/1813494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 06/27/2017] [Indexed: 11/17/2022]
Abstract
Background and Objective Mining the genes related to maize carotenoid components is important to improve the carotenoid content and the quality of maize. Methods On the basis of using the entropy estimation method with Gaussian kernel probability density estimator, we use the three-phase dependency analysis (TPDA) Bayesian network structure learning method to construct the network of maize gene and carotenoid components traits. Results In the case of using two discretization methods and setting different discretization values, we compare the learning effect and efficiency of 10 kinds of Bayesian network structure learning methods. The method is verified and analyzed on the maize dataset of global germplasm collection with 527 elite inbred lines. Conclusions The result confirmed the effectiveness of the TPDA method, which outperforms significantly another 9 kinds of Bayesian network learning methods. It is an efficient method of mining genes for maize carotenoid components traits. The parameters obtained by experiments will help carry out practical gene mining effectively in the future.
Collapse
|
37
|
Pashaei E, Aydin N. Binary black hole algorithm for feature selection and classification on biological data. Appl Soft Comput 2017. [DOI: 10.1016/j.asoc.2017.03.002] [Citation(s) in RCA: 114] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
38
|
Lazzarini N, Bacardit J. RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers. BMC Bioinformatics 2017; 18:322. [PMID: 28666416 PMCID: PMC5493069 DOI: 10.1186/s12859-017-1729-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2016] [Accepted: 06/13/2017] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Current -omics technologies are able to sense the state of a biological sample in a very wide variety of ways. Given the high dimensionality that typically characterises these data, relevant knowledge is often hidden and hard to identify. Machine learning methods, and particularly feature selection algorithms, have proven very effective over the years at identifying small but relevant subsets of variables from a variety of application domains, including -omics data. Many methods exist with varying trade-off between the size of the identified variable subsets and the predictive power of such subsets. In this paper we focus on an heuristic for the identification of biomarkers called RGIFE: Rank Guided Iterative Feature Elimination. RGIFE is guided in its biomarker identification process by the information extracted from machine learning models and incorporates several mechanisms to ensure that it creates minimal and highly predictive features sets. RESULTS We compare RGIFE against five well-known feature selection algorithms using both synthetic and real (cancer-related transcriptomics) datasets. First, we assess the ability of the methods to identify relevant and highly predictive features. Then, using a prostate cancer dataset as a case study, we look at the biological relevance of the identified biomarkers. CONCLUSIONS We propose RGIFE, a heuristic for the inference of reduced panels of biomarkers that obtains similar predictive performance to widely adopted feature selection methods while selecting significantly fewer feature. Furthermore, focusing on the case study, we show the higher biological relevance of the biomarkers selected by our approach. The RGIFE source code is available at: http://ico2s.org/software/rgife.html .
Collapse
Affiliation(s)
- Nicola Lazzarini
- ICOS research group, School of Computing Science, Newcastle-upon-Tyne, UK
| | - Jaume Bacardit
- ICOS research group, School of Computing Science, Newcastle-upon-Tyne, UK.
| |
Collapse
|
39
|
Fu G, Wang G, Dai X. An adaptive threshold determination method of feature screening for genomic selection. BMC Bioinformatics 2017; 18:212. [PMID: 28403836 PMCID: PMC5389084 DOI: 10.1186/s12859-017-1617-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Accepted: 03/28/2017] [Indexed: 11/10/2022] Open
Abstract
Background Although the dimension of the entire genome can be extremely large, only a parsimonious set of influential SNPs are correlated with a particular complex trait and are important to the prediction of the trait. Efficiently and accurately selecting these influential SNPs from millions of candidates is in high demand, but poses challenges. We propose a backward elimination iterative distance correlation (BE-IDC) procedure to select the smallest subset of SNPs that guarantees sufficient prediction accuracy, while also solving the unclear threshold issue for traditional feature screening approaches. Results Verified through six simulations, the adaptive threshold estimated by the BE-IDC performed uniformly better than fixed threshold methods that have been used in the current literature. We also applied BE-IDC to an Arabidopsis thaliana genome-wide data. Out of 216,130 SNPs, BE-IDC selected four influential SNPs, and confirmed the same FRIGIDA gene that was reported by two other traditional methods. Conclusions BE-IDC accommodates both the prediction accuracy and the computational speed that are highly demanded in the genomic selection. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1617-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Guifang Fu
- Department of Mathematics and Statistics, Utah State University, Logan, 84322, UT, USA.
| | - Gang Wang
- Department of Mathematics and Statistics, Utah State University, Logan, 84322, UT, USA
| | - Xiaotian Dai
- Department of Mathematics and Statistics, Utah State University, Logan, 84322, UT, USA
| |
Collapse
|
40
|
MotieGhader H, Gharaghani S, Masoudi-Sobhanzadeh Y, Masoudi-Nejad A. Sequential and Mixed Genetic Algorithm and Learning Automata (SGALA, MGALA) for Feature Selection in QSAR. IRANIAN JOURNAL OF PHARMACEUTICAL RESEARCH : IJPR 2017; 16:533-553. [PMID: 28979308 PMCID: PMC5603862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Feature selection is of great importance in Quantitative Structure-Activity Relationship (QSAR) analysis. This problem has been solved using some meta-heuristic algorithms such as GA, PSO, ACO and so on. In this work two novel hybrid meta-heuristic algorithms i.e. Sequential GA and LA (SGALA) and Mixed GA and LA (MGALA), which are based on Genetic algorithm and learning automata for QSAR feature selection are proposed. SGALA algorithm uses advantages of Genetic algorithm and Learning Automata sequentially and the MGALA algorithm uses advantages of Genetic Algorithm and Learning Automata simultaneously. We applied our proposed algorithms to select the minimum possible number of features from three different datasets and also we observed that the MGALA and SGALA algorithms had the best outcome independently and in average compared to other feature selection algorithms. Through comparison of our proposed algorithms, we deduced that the rate of convergence to optimal result in MGALA and SGALA algorithms were better than the rate of GA, ACO, PSO and LA algorithms. In the end, the results of GA, ACO, PSO, LA, SGALA, and MGALA algorithms were applied as the input of LS-SVR model and the results from LS-SVR models showed that the LS-SVR model had more predictive ability with the input from SGALA and MGALA algorithms than the input from all other mentioned algorithms. Therefore, the results have corroborated that not only is the predictive efficiency of proposed algorithms better, but their rate of convergence is also superior to the all other mentioned algorithms.
Collapse
Affiliation(s)
- Habib MotieGhader
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| | - Sajjad Gharaghani
- Laboratory of Bioinformatics and Drug Design (LBD), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| | - Yosef Masoudi-Sobhanzadeh
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| | - Ali Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| |
Collapse
|
41
|
Motieghader H, Najafi A, Sadeghi B, Masoudi-Nejad A. A hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata. INFORMATICS IN MEDICINE UNLOCKED 2017. [DOI: 10.1016/j.imu.2017.10.004] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
|
42
|
Lai CM, Yeh WC, Chang CY. Gene selection using information gain and improved simplified swarm optimization. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.08.089] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
43
|
Liu Y, Fei T, Zheng X, Brown M, Zhang P, Liu XS, Wang H. An Integrative Pharmacogenomic Approach Identifies Two-drug Combination Therapies for Personalized Cancer Medicine. Sci Rep 2016; 6:22120. [PMID: 26916442 PMCID: PMC4768263 DOI: 10.1038/srep22120] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2015] [Accepted: 02/08/2016] [Indexed: 02/05/2023] Open
Abstract
An individual tumor harbors multiple molecular alterations that promote cell proliferation and prevent apoptosis and differentiation. Drugs that target specific molecular alterations have been introduced into personalized cancer medicine, but their effects can be modulated by the activities of other genes or molecules. Previous studies aiming to identify multiple molecular alterations for combination therapies are limited by available data. Given the recent large scale of available pharmacogenomic data, it is possible to systematically identify multiple biomarkers that contribute jointly to drug sensitivity, and to identify combination therapies for personalized cancer medicine. In this study, we used pharmacogenomic profiling data provided from two independent cohorts in a systematic in silico investigation of perturbed genes cooperatively associated with drug sensitivity. Our study predicted many pairs of molecular biomarkers that may benefit from the use of combination therapies. One of our predicted biomarker pairs, a mutation in the BRAF gene and upregulated expression of the PIM1 gene, was experimentally validated to benefit from a therapy combining BRAF inhibitor and PIM1 inhibitor in lung cancer. This study demonstrates how pharmacogenomic data can be used to systematically identify potentially cooperative genes and provide novel insights to combination therapies in personalized cancer medicine.
Collapse
Affiliation(s)
- Yin Liu
- School of Life Science and Technology, Tongji University, Shanghai 200092, China.,Shanghai Key laboratory of tuberculosis, Shanghai Pulmonary Hospital, Shanghai 200433, China.,Department of Biostatistics and Computational Biology, Dana-Faber Cancer Institute and Harvard School of Public Health, Boston, MA 02215, USA
| | - Teng Fei
- Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02115, USA.,Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, Massachusetts, MA 02115, USA
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai 200234, China
| | - Myles Brown
- Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02115, USA.,Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, Massachusetts, MA 02115, USA
| | - Peng Zhang
- Department of Thoracic Surgery, Shanghai Pulmonary Hospital of Tongji University School of Medicine, Shanghai 200433, China
| | - X Shirley Liu
- Department of Biostatistics and Computational Biology, Dana-Faber Cancer Institute and Harvard School of Public Health, Boston, MA 02215, USA.,Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, Massachusetts, MA 02115, USA
| | - Haiyun Wang
- School of Life Science and Technology, Tongji University, Shanghai 200092, China.,Department of Biostatistics and Computational Biology, Dana-Faber Cancer Institute and Harvard School of Public Health, Boston, MA 02215, USA
| |
Collapse
|
44
|
Pashaei E, Ozen M, Aydin N. Improving medical diagnosis reliability using Boosted C5.0 decision tree empowered by Particle Swarm Optimization. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016; 2015:7230-3. [PMID: 26737960 DOI: 10.1109/embc.2015.7320060] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Improving accuracy of supervised classification algorithms in biomedical applications is one of active area of research. In this study, we improve the performance of Particle Swarm Optimization (PSO) combined with C4.5 decision tree (PSO+C4.5) classifier by applying Boosted C5.0 decision tree as the fitness function. To evaluate the effectiveness of our proposed method, it is implemented on 1 microarray dataset and 5 different medical data sets obtained from UCI machine learning databases. Moreover, the results of PSO + Boosted C5.0 implementation are compared to eight well-known benchmark classification methods (PSO+C4.5, support vector machine under the kernel of Radial Basis Function, Classification And Regression Tree (CART), C4.5 decision tree, C5.0 decision tree, Boosted C5.0 decision tree, Naive Bayes and Weighted K-Nearest neighbor). Repeated five-fold cross-validation method was used to justify the performance of classifiers. Experimental results show that our proposed method not only improve the performance of PSO+C4.5 but also obtains higher classification accuracy compared to the other classification methods.
Collapse
|
45
|
He H, Lin D, Zhang J, Wang Y, Deng HW. Biostatistics, Data Mining and Computational Modeling. TRANSLATIONAL BIOINFORMATICS 2016. [DOI: 10.1007/978-94-017-7543-4_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
46
|
A Systematic Evaluation of Feature Selection and Classification Algorithms Using Simulated and Real miRNA Sequencing Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:178572. [PMID: 26508990 PMCID: PMC4609795 DOI: 10.1155/2015/178572] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Accepted: 08/25/2015] [Indexed: 11/29/2022]
Abstract
Sequencing is widely used to discover associations between microRNAs (miRNAs) and diseases. However, the negative binomial distribution (NB) and high dimensionality of data obtained using sequencing can lead to low-power results and low reproducibility. Several statistical learning algorithms have been proposed to address sequencing data, and although evaluation of these methods is essential, such studies are relatively rare. The performance of seven feature selection (FS) algorithms, including baySeq, DESeq, edgeR, the rank sum test, lasso, particle swarm optimistic decision tree, and random forest (RF), was compared by simulation under different conditions based on the difference of the mean, the dispersion parameter of the NB, and the signal to noise ratio. Real data were used to evaluate the performance of RF, logistic regression, and support vector machine. Based on the simulation and real data, we discuss the behaviour of the FS and classification algorithms. The Apriori algorithm identified frequent item sets (mir-133a, mir-133b, mir-183, mir-937, and mir-96) from among the deregulated miRNAs of six datasets from The Cancer Genomics Atlas. Taking these findings altogether and considering computational memory requirements, we propose a strategy that combines edgeR and DESeq for large sample sizes.
Collapse
|
47
|
Zhang R, Li J, Du Q, Ren F. Basic farmland zoning and protection under spatial constraints with a particle swarm optimisation multiobjective decision model: a case study of Yicheng, China. ACTA ACUST UNITED AC 2015. [DOI: 10.1068/b130213p] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
The rapid development of the Chinese economy has led to an increasing fraction of agricultural land being converted to nonagricultural uses. The zoning and protection of farmland with the best agricultural quality (basic farmland) is extremely intensive and complicated work. In this paper we establish a remote sensing, geographic information system, and particle swarm optimisation (PSO) multiobjective decision model (MODM) to calculate the optimum solution for basic farmland protection. Furthermore, a new particle evolution rule combined with a genetic algorithm is introduced to improve the solution performance. The PSO-based zoning model is then utilised in the case study of Yicheng, Hubei Province, China, to demonstrate that our MODM framework excels in providing an optimum solution for balancing the three objectives of basic farmland zoning and protection: maximising farmland spatial compactness, maximising farmland soil fertility, and minimising transportation cost. In particular, the model compares alternative Pareto-optimal scenarios in which several objectives can be achieved without compromising the other objectives to obtain a real and practical blueprint for action. Our model enables urban planners to test and compare the different scenarios under various particle swarm conditions. In addition, the PSO-based zoning model constitutes a true guide for real-world planners, and this model can be extended to specify basic farmland protection optimisations in other regions of China.
Collapse
Affiliation(s)
- Ran Zhang
- School of Resource and Environmental Sciences, Wuhan University, Wuhan 430079, China; and Center for Assessment and Development of Real Estate Shenzhen, Shenzhen 518000, China
| | - Jing Li
- Department of Public Policy, City University of Hong Kong, Kowloon, Hong Kong
| | - Qingyun Du
- School of Resource and Environmental Science, Wuhan University, Wuhan 430079, China; and Key Laboratory of Geographic Information System, Wuhan University, Wuhan, 430079, China
| | - Fu Ren
- School of Resource and Environmental Science, Wuhan University, Wuhan 430079, China; and Key Laboratory of Geographic Information System, Wuhan University, Wuhan, 430079, China
| |
Collapse
|
48
|
Mahmoud O, Harrison A, Perperoglou A, Gul A, Khan Z, Metodiev MV, Lausen B. A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics 2014; 15:274. [PMID: 25113817 PMCID: PMC4141116 DOI: 10.1186/1471-2105-15-274] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Accepted: 08/01/2014] [Indexed: 11/16/2022] Open
Abstract
Background Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature’s relevance to a classification task. Results We apply POS, along‐with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance. Conclusions A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along‐with a novel gene score are exploited to produce the selected subset of genes. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-274) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Osama Mahmoud
- Department of Mathematical Sciences, University of Essex, Wivenhoe Park, CO4 3SQ Colchester, UK.
| | | | | | | | | | | | | |
Collapse
|