1
|
Li Z, Song C, Yang J, Jia Z, Chen D, Yan C, Tian L, Wu X. Clustering algorithm based on DINNSM and its application in gene expression data analysis. Technol Health Care 2024:THC248020. [PMID: 38759052 DOI: 10.3233/thc-248020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024]
Abstract
BACKGROUND Selecting an appropriate similarity measurement method is crucial for obtaining biologically meaningful clustering modules. Commonly used measurement methods are insufficient in capturing the complexity of biological systems and fail to accurately represent their intricate interactions. OBJECTIVE This study aimed to obtain biologically meaningful gene modules by using the clustering algorithm based on a similarity measurement method. METHODS A new algorithm called the Dual-Index Nearest Neighbor Similarity Measure (DINNSM) was proposed. This algorithm calculated the similarity matrix between genes using Pearson's or Spearman's correlation. It was then used to construct a nearest-neighbor table based on the similarity matrix. The final similarity matrix was reconstructed using the positions of shared genes in the nearest neighbor table and the number of shared genes. RESULTS Experiments were conducted on five different gene expression datasets and compared with five widely used similarity measurement techniques for gene expression data. The findings demonstrate that when utilizing DINNSM as the similarity measure, the clustering results performed better than using alternative measurement techniques. CONCLUSIONS DINNSM provided more accurate insights into the intricate biological connections among genes, facilitating the identification of more accurate and biological gene co-expression modules.
Collapse
Affiliation(s)
- Zongjin Li
- Department of Computer, Qinghai Normal University, Xining, Qinghai, China
| | - Changxin Song
- Department of Mechanical Engineering and Information, Shanghai Urban Construction Vocational College, Shanghai, China
| | - Jiyu Yang
- Department of Cardiovascular Medicine, Xining First People's Hospital, Xining, Qinghai, China
| | - Zeyu Jia
- Department of Computer, Qinghai Normal University, Xining, Qinghai, China
| | - Dongzhen Chen
- School of Materials Science and Engineering, Xi'an Polytechnic University, Xi'an, Shaanxi, China
| | - Chengying Yan
- Department of Cardiovascular Medicine, Xining First People's Hospital, Xining, Qinghai, China
| | - Liqin Tian
- Department of Computer, Qinghai Normal University, Xining, Qinghai, China
- School of Computer, North China Institute of Science and Technology, Langfang, Hebei, China
| | - Xiaoming Wu
- The Key Laboratory of Biomedical Information Engineering of Ministry of Education, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi, China
| |
Collapse
|
2
|
Mofidifar S, Yadegar A, Karimi-Jafari MH. A reconstructed genome-scale metabolic model of Helicobacter pylori for predicting putative drug targets in clarithromycin and rifampicin resistance conditions. Helicobacter 2024; 29:e13074. [PMID: 38615332 DOI: 10.1111/hel.13074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 03/27/2024] [Accepted: 04/01/2024] [Indexed: 04/16/2024]
Abstract
BACKGROUND Helicobacter pylori is considered a true human pathogen for which rising drug resistance constitutes a drastic concern globally. The present study aimed to reconstruct a genome-scale metabolic model (GSMM) to decipher the metabolic capability of H. pylori strains in response to clarithromycin and rifampicin along with identification of novel drug targets. MATERIALS AND METHODS The iIT341 model of H. pylori was updated based on genome annotation data, and biochemical knowledge from literature and databases. Context-specific models were generated by integrating the transcriptomic data of clarithromycin and rifampicin resistance into the model. Flux balance analysis was employed for identifying essential genes in each strain, which were further prioritized upon being nonhomologs to humans, virulence factor analysis, druggability, and broad-spectrum analysis. Additionally, metabolic differences between sensitive and resistant strains were also investigated based on flux variability analysis and pathway enrichment analysis of transcriptomic data. RESULTS The reconstructed GSMM was named as HpM485 model. Pathway enrichment and flux variability analyses demonstrated reduced activity in the ribosomal pathway in both clarithromycin- and rifampicin-resistant strains. Also, a significant decrease was detected in the activity of metabolic pathways of clarithromycin-resistant strain. Moreover, 23 and 16 essential genes were exclusively detected in clarithromycin- and rifampicin-resistant strains, respectively. Based on prioritization analysis, cyclopropane fatty acid synthase and phosphoenolpyruvate synthase were identified as putative drug targets in clarithromycin- and rifampicin-resistant strains, respectively. CONCLUSIONS We present a robust and reliable metabolic model of H. pylori. This model can predict novel drug targets to combat drug resistance and explore the metabolic capability of H. pylori in various conditions.
Collapse
Affiliation(s)
- Sepideh Mofidifar
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Abbas Yadegar
- Foodborne and Waterborne Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran
- Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | | |
Collapse
|
3
|
Caudai C, Salerno E. Complementing Hi-C information for 3D chromatin reconstruction by ChromStruct. Front Bioinform 2024; 3:1287168. [PMID: 38318534 PMCID: PMC10840501 DOI: 10.3389/fbinf.2023.1287168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 12/20/2023] [Indexed: 02/07/2024] Open
Abstract
A multiscale method proposed elsewhere for reconstructing plausible 3D configurations of the chromatin in cell nuclei is recalled, based on the integration of contact data from Hi-C experiments and additional information coming from ChIP-seq, RNA-seq and ChIA-PET experiments. Provided that the additional data come from independent experiments, this kind of approach is supposed to leverage them to complement possibly noisy, biased or missing Hi-C records. When the different data sources are mutually concurrent, the resulting solutions are corroborated; otherwise, their validity would be weakened. Here, a problem of reliability arises, entailing an appropriate choice of the relative weights to be assigned to the different informational contributions. A series of experiments is presented that help to quantify the advantages and the limitations offered by this strategy. Whereas the advantages in accuracy are not always significant, the case of missing Hi-C data demonstrates the effectiveness of additional information in reconstructing the highly packed segments of the structure.
Collapse
Affiliation(s)
- Claudia Caudai
- Institute of Information Science and Technologies, National Research Council of Italy, Pisa, Italy
| | | |
Collapse
|
4
|
Buzzao D, Castresana-Aguirre M, Guala D, Sonnhammer ELL. Benchmarking enrichment analysis methods with the disease pathway network. Brief Bioinform 2024; 25:bbae069. [PMID: 38436561 PMCID: PMC10939300 DOI: 10.1093/bib/bbae069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 01/10/2024] [Accepted: 02/03/2024] [Indexed: 03/05/2024] Open
Abstract
Enrichment analysis (EA) is a common approach to gain functional insights from genome-scale experiments. As a consequence, a large number of EA methods have been developed, yet it is unclear from previous studies which method is the best for a given dataset. The main issues with previous benchmarks include the complexity of correctly assigning true pathways to a test dataset, and lack of generality of the evaluation metrics, for which the rank of a single target pathway is commonly used. We here provide a generalized EA benchmark and apply it to the most widely used EA methods, representing all four categories of current approaches. The benchmark employs a new set of 82 curated gene expression datasets from DNA microarray and RNA-Seq experiments for 26 diseases, of which only 13 are cancers. In order to address the shortcomings of the single target pathway approach and to enhance the sensitivity evaluation, we present the Disease Pathway Network, in which related Kyoto Encyclopedia of Genes and Genomes pathways are linked. We introduce a novel approach to evaluate pathway EA by combining sensitivity and specificity to provide a balanced evaluation of EA methods. This approach identifies Network Enrichment Analysis methods as the overall top performers compared with overlap-based methods. By using randomized gene expression datasets, we explore the null hypothesis bias of each method, revealing that most of them produce skewed P-values.
Collapse
Affiliation(s)
- Davide Buzzao
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden
| | | | - Dimitri Guala
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 171 21 Solna, Sweden
| |
Collapse
|
5
|
Turfan D, Altunkaynak B, Yeniay Ö. A New Filter Approach Based on Effective Ranges for Classification of Gene Expression Data. Big Data 2023. [PMID: 37668992 DOI: 10.1089/big.2022.0086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/06/2023]
Abstract
Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.
Collapse
Affiliation(s)
- Derya Turfan
- Department of Statistics, Hacettepe University, Ankara, Turkey
| | | | - Özgür Yeniay
- Department of Statistics, Hacettepe University, Ankara, Turkey
| |
Collapse
|
6
|
Möllenhoff K, Schorning K, Kappenberg F. Identifying alert concentrations using a model-based bootstrap approach. Biometrics 2023; 79:2076-2088. [PMID: 36385693 DOI: 10.1111/biom.13799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 10/31/2022] [Indexed: 11/18/2022]
Abstract
The determination of alert concentrations, where a pre-specified threshold of the response variable is exceeded, is an important goal of concentration-response studies. The traditional approach is based on investigating the measured concentrations and attaining statistical significance of the alert concentration by using a multiple t-test procedure. In this paper, we propose a new model-based method to identify alert concentrations, based on fitting a concentration-response curve and constructing a simultaneous confidence band for the difference of the response of a concentration compared to the control. In order to obtain these confidence bands, we use a bootstrap approach which can be applied to any functional form of the concentration-response curve. This particularly offers the possibility to investigate also those situations where the concentration-response relationship is not monotone and, moreover, to detect alerts at concentrations which were not measured during the study, providing a highly flexible framework for the problem at hand.
Collapse
|
7
|
Donate-Correa J, Martín-Núñez E, Martin-Olivera A, Mora-Fernández C, Tagua VG, Ferri CM, López-Castillo Á, Delgado-Molinos A, López-Tarruella VC, Arévalo-Gómez MA, Pérez-Delgado N, González-Luis A, Navarro-González JF. Klotho inversely relates with carotid intima- media thickness in atherosclerotic patients with normal renal function (eGFR ≥60 mL/min/1.73m 2): a proof-of-concept study. Front Endocrinol (Lausanne) 2023; 14:1146012. [PMID: 37274332 PMCID: PMC10235765 DOI: 10.3389/fendo.2023.1146012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 04/26/2023] [Indexed: 06/06/2023] Open
Abstract
Introduction Klotho protein is predominantly expressed in the kidneys and has also been detected in vascular tissue and peripheral blood circulating cells to a lesser extent. Carotid artery intima-media thickness (CIMT) burden, a marker of subclinical atherosclerosis, has been associated with reductions in circulating Klotho levels in chronic kidney disease patients, who show reduced levels of this protein at all stages of the disease. However, the contribution of serum Klotho and its expression levels in peripheral blood circulating cells and in the carotid artery wall on the CIMT in the absence of kidney impairment has not yet been evaluated. Methods We conducted a single-center study in 35 atherosclerotic patients with preserved kidney function (eGFR≥60 mL/min/1.73m2) subjected to elective carotid surgery. Serum levels of Klotho and cytokines TNFa, IL6 and IL10 were determined by ELISA and transcripts encoding for Klotho (KL), TNF, IL6 and IL10 from vascular segments were measured by qRT-PCR. Klotho protein expression in the intima-media and adventitia areas was analyzed using immunohistochemistry. Results APatients with higher values of CIMT showed reduced Klotho levels in serum (430.8 [357.7-592.9] vs. 667.8 [632.5-712.9] pg/mL; p<0.001), mRNA expression in blood circulating cells and carotid artery wall (2.92 [2.06-4.8] vs. 3.69 [2.42-7.13] log.a.u., p=0.015; 0.41 [0.16-0.59] vs. 0.79 [0.37-1.4] log.a.u., p=0.013, respectively) and immunoreactivity in the intimal-medial area of the carotids (4.23 [4.15-4.27] vs. 4.49 [4.28-4.63] log µm2 p=0.008). CIMT was inversely related with Klotho levels in serum (r= -0.717, p<0.001), blood mRNA expression (r=-0.426, p=0.011), and with carotid artery mRNA and immunoreactivity levels (r= -0.45, p=0.07; r= -0.455, p= 0.006, respectively). Multivariate analysis showed that serum Klotho, together with the gene expression levels of tumor necrosis factor TNFa in blood circulating cells, were independent determinants of CIMT values (adjusted R2 = 0.593, p<0.001). Discussion The results of this study in subjects with eGFR≥60mL/min/1.73m2 show that patients with carotid artery atherosclerosis and higher values of CIMT present reduced soluble Klotho levels, as well as decreased KL mRNA expression in peripheral blood circulating cells and Klotho protein levels in the intima-media of the carotid artery wall.
Collapse
Affiliation(s)
- Javier Donate-Correa
- Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
- GEENDIAB (Grupo Español para el estudio de la Nefropatía Diabética), Sociedad Española de Nefrología, Santander, Spain
- Instituto de Tecnologías Biomédicas, Universidad de La Laguna, Santa Cruz de Tenerife, Spain
- RICORS2040 (Red de Investigación Renal-RD21/0005/0013), Instituto de Salud Carlos III, Madrid, Spain
| | - Ernesto Martín-Núñez
- Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
- GEENDIAB (Grupo Español para el estudio de la Nefropatía Diabética), Sociedad Española de Nefrología, Santander, Spain
| | - Alberto Martin-Olivera
- Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
| | - Carmen Mora-Fernández
- Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
- GEENDIAB (Grupo Español para el estudio de la Nefropatía Diabética), Sociedad Española de Nefrología, Santander, Spain
- RICORS2040 (Red de Investigación Renal-RD21/0005/0013), Instituto de Salud Carlos III, Madrid, Spain
| | - Víctor G. Tagua
- Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
- Instituto de Tecnologías Biomédicas, Universidad de La Laguna, Santa Cruz de Tenerife, Spain
- Área de Medicina Preventiva y Salud Pública, Universidad de La Laguna, San Cristóbal de La Laguna, Spain
| | - Carla M. Ferri
- Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
- Escuela de Doctorado y Estudios de Posgrado, Universidad de La Laguna, San Cristóbal de La Laguna, Spain
| | | | | | | | | | | | - Ainhoa González-Luis
- Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
- Escuela de Doctorado y Estudios de Posgrado, Universidad de La Laguna, San Cristóbal de La Laguna, Spain
| | - Juan F. Navarro-González
- Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria (HUNSC), Santa Cruz de Tenerife, Spain
- GEENDIAB (Grupo Español para el estudio de la Nefropatía Diabética), Sociedad Española de Nefrología, Santander, Spain
- Instituto de Tecnologías Biomédicas, Universidad de La Laguna, Santa Cruz de Tenerife, Spain
- RICORS2040 (Red de Investigación Renal-RD21/0005/0013), Instituto de Salud Carlos III, Madrid, Spain
- Servicio de Nefrología, HUNSC, Santa Cruz de Tenerife, Spain
| |
Collapse
|
8
|
Wang Z, Gu H, Zhao M, Li D, Wang J. MSC-CSMC: A multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints for gene expression data. Front Genet 2023; 14:1135260. [PMID: 36923794 PMCID: PMC10008853 DOI: 10.3389/fgene.2023.1135260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Accepted: 02/16/2023] [Indexed: 03/01/2023] Open
Abstract
Many clustering techniques have been proposed to group genes based on gene expression data. Among these methods, semi-supervised clustering techniques aim to improve clustering performance by incorporating supervisory information in the form of pairwise constraints. However, noisy constraints inevitably exist in the constraint set obtained on the practical unlabeled dataset, which degenerates the performance of semi-supervised clustering. Moreover, multiple information sources are not integrated into multi-source constraints to improve clustering quality. To this end, the research proposes a new multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints (MSC-CSMC) for unlabeled gene expression data. The proposed method first uses the gene expression data and the gene ontology (GO) that describes gene annotation information to form multi-source constraints. Then, the multi-source constraints are applied to the clustering by improving the constraint violation penalty weight in the semi-supervised clustering objective function. Furthermore, the constraints selection and cluster prototypes are put into the multi-objective evolutionary framework by adopting a mixed chromosome encoding strategy, which can select pairwise constraints suitable for clustering tasks through synergistic optimization to reduce the negative influence of noisy constraints. The proposed MSC-CSMC algorithm is testified using five benchmark gene expression datasets, and the results show that the proposed algorithm achieves superior performance.
Collapse
Affiliation(s)
- Zeyuan Wang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Hong Gu
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Minghui Zhao
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Dan Li
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Jia Wang
- Department of Breast Surgery, Second Hospital of Dalian Medical University, Dalian, Liaoning, China
| |
Collapse
|
9
|
Georgieva O. An Iterative Unsupervised Method for Gene Expression Differentiation. Genes (Basel) 2023; 14:412. [PMID: 36833339 PMCID: PMC9956932 DOI: 10.3390/genes14020412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 01/24/2023] [Accepted: 02/01/2023] [Indexed: 02/09/2023] Open
Abstract
For several decades, intensive research for understanding gene activity and its role in organism's lives is the research focus of scientists in different areas. A part of these investigations is the analysis of gene expression data for selecting differentially expressed genes. Methods that identify the interested genes have been proposed on statistical data analysis. The problem is that there is no good agreement among them, as different results are produced by distinct methods. By taking the advantage of the unsupervised data analysis, an iterative clustering procedure that finds differentially expressed genes shows promising results. In the present paper, a comparative study of the clustering methods applied for gene expression analysis is presented to explicate the choice of the clustering algorithm implemented in the method. An investigation of different distance measures is provided to reveal those that increase the efficiency of the method in finding the real data structure. Further, the method is improved by incorporating an additional aggregation measure based on the standard deviation of the expression levels. Its usage increases the gene distinction as a new amount of differentially expressed genes is found. The method is summarized in a detailed procedure. The significance of the method is proved by an analysis of two mice strain data sets. The differentially expressed genes defined by the proposed method are compared with those selected by the well-known statistical methods applied to the same data set.
Collapse
Affiliation(s)
- Olga Georgieva
- Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", 125 Tsarigradsko Shosse Blvd., bl. 2, 1113 Sofia, Bulgaria
| |
Collapse
|
10
|
Li D, Liang H, Qin P, Wang J. A self-training subspace clustering algorithm based on adaptive confidence for gene expression data. Front Genet 2023; 14:1132370. [PMID: 37025450 PMCID: PMC10070828 DOI: 10.3389/fgene.2023.1132370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Accepted: 03/07/2023] [Indexed: 04/08/2023] Open
Abstract
Gene clustering is one of the important techniques to identify co-expressed gene groups from gene expression data, which provides a powerful tool for investigating functional relationships of genes in biological process. Self-training is a kind of important semi-supervised learning method and has exhibited good performance on gene clustering problem. However, the self-training process inevitably suffers from mislabeling, the accumulation of which will lead to the degradation of semi-supervised learning performance of gene expression data. To solve the problem, this paper proposes a self-training subspace clustering algorithm based on adaptive confidence for gene expression data (SSCAC), which combines the low-rank representation of gene expression data and adaptive adjustment of label confidence to better guide the partition of unlabeled data. The superiority of the proposed SSCAC algorithm is mainly reflected in the following aspects. 1) In order to improve the discriminative property of gene expression data, the low-rank representation with distance penalty is used to mine the potential subspace structure of data. 2) Considering the problem of mislabeling in self-training, a semi-supervised clustering objective function with label confidence is proposed, and a self-training subspace clustering framework is constructed on this basis. 3) In order to mitigate the negative impact of mislabeled data, an adaptive adjustment strategy based on gravitational search algorithm is proposed for label confidence. Compared with a variety of state-of-the-art unsupervised and semi-supervised learning algorithms, the SSCAC algorithm has demonstrated its superiority through extensive experiments on two benchmark gene expression datasets.
Collapse
Affiliation(s)
- Dan Li
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Hongnan Liang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Pan Qin
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
- *Correspondence: Pan Qin, ; Jia Wang,
| | - Jia Wang
- Department of Breast Surgery, The Second Hospital of Dalian Medical University, Dalian, Liaoning, China
- *Correspondence: Pan Qin, ; Jia Wang,
| |
Collapse
|
11
|
Sen Puliparambil B, Tomal JH, Yan Y. A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data. Biology (Basel) 2022; 11:biology11101495. [PMID: 36290397 PMCID: PMC9598401 DOI: 10.3390/biology11101495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 09/21/2022] [Accepted: 09/30/2022] [Indexed: 11/05/2022]
Abstract
With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.
Collapse
Affiliation(s)
- Bhavithry Sen Puliparambil
- Master of Science in Data Science Program, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
- Correspondence:
| | - Jabed H. Tomal
- Department of Mathematics and Statistics, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| | - Yan Yan
- Department of Computing Science, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| |
Collapse
|
12
|
Zhang S, Zhang M. Use of SVM-based ensemble feature selection method for gene expression data analysis. Stat Appl Genet Mol Biol 2022; 21:sagmb-2022-0002. [PMID: 35848211 DOI: 10.1515/sagmb-2022-0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 07/01/2022] [Indexed: 11/15/2022]
Abstract
Gene selection is one of the key steps for gene expression data analysis. An SVM-based ensemble feature selection method is proposed in this paper. Firstly, the method builds many subsets by using Monte Carlo sampling. Secondly, ranking all the features on each of the subsets and integrating them to obtain a final ranking list. Finally, the optimum feature set is determined by a backward feature elimination strategy. This method is applied to the analysis of 4 public datasets: the Leukemia, Prostate, Colorectal, and SMK_CAN, resulting 7, 10, 13, and 32 features. The AUC obtained from independent test sets are 0.9867, 0.9796, 0.9571, and 0.9575, respectively. These results indicate that the features selected by the proposed method can improve sample classification accuracy, and thus be effective for gene selection from gene expression data.
Collapse
Affiliation(s)
- Shizhi Zhang
- School of Chemistry and Chemical Engineering, Qinghai Minzu University, Xining 810007, P.R. China
| | - Mingjin Zhang
- School of Chemistry and Chemical Engineering, Qinghai Normal University, Xining 810016, P.R. China
| |
Collapse
|
13
|
Zhang X, Ye Z, Chen J, Qiao F. AMDBNorm: an approach based on distribution adjustment to eliminate batch effects of gene expression data. Brief Bioinform 2021; 23:6485011. [PMID: 34958674 DOI: 10.1093/bib/bbab528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 10/16/2021] [Accepted: 11/14/2021] [Indexed: 11/14/2022] Open
Abstract
Batch effects explain a large part of the noise when merging gene expression data. Removing irrelevant variations introduced by batch effects plays an important role in gene expression studies. To obtain reliable differential analysis results, it is necessary to remove the variation caused by technical conditions between different batches while preserving biological variation. Usually, merging data directly with batch effects leads to a sharp rise in false positives. Although some methods of batch correction have been developed, they have some drawbacks. In this study, we develop a new algorithm, adjustment mean distribution-based normalization (AMDBNorm), which is based on a probability distribution to correct batch effects while preserving biological variation. AMDBNorm solves the defects of the existing batch correction methods. We compared several popular methods of batch correction with AMDBNorm using two real gene expression datasets with batch effects and analyzed the results of batch correction from the visual and quantitative perspectives. To ensure the biological variation was well protected, the effects of the batch correction methods were verified by hierarchical cluster analysis. The results showed that the AMDBNorm algorithm could remove batch effects of gene expression data effectively and retain more biological variation than other methods. Our approach provides the researchers with reliable data support in the study of differential gene expression analysis and prognostic biomarker selection.
Collapse
Affiliation(s)
- Xu Zhang
- School of Mathematics and Statistics, Southwest University, China
| | | | - Jing Chen
- School of Science, Southwest University of Science and Technology, China
| | | |
Collapse
|
14
|
Almars AM, Alwateer M, Qaraad M, Amjad S, Fathi H, Kelany AK, Hussein NK, Elhosseini M. Brain Cancer Prediction Based on Novel Interpretable Ensemble Gene Selection Algorithm and Classifier. Diagnostics (Basel) 2021; 11:1936. [PMID: 34679634 DOI: 10.3390/diagnostics11101936] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 10/12/2021] [Accepted: 10/13/2021] [Indexed: 11/17/2022] Open
Abstract
The growth of abnormal cells in the brain causes human brain tumors. Identifying the type of tumor is crucial for the prognosis and treatment of the patient. Data from cancer microarrays typically include fewer samples with many gene expression levels as features, reflecting the curse of dimensionality and making classifying data from microarrays challenging. In most of the examined studies, cancer classification (Malignant and benign) accuracy was examined without disclosing biological information related to the classification process. A new approach was proposed to bridge the gap between cancer classification and the interpretation of the biological studies of the genes implicated in cancer. This study aims to develop a new hybrid model for cancer classification (by using feature selection mRMRe as a key step to improve the performance of classification methods and a distributed hyperparameter optimization for gradient boosting ensemble methods). To evaluate the proposed method, NB, RF, and SVM classifiers have been chosen. In terms of the AUC, sensitivity, and specificity, the optimized CatBoost classifier performed better than the optimized XGBoost in cross-validation 5, 6, 8, and 10. With an accuracy of 0.91±0.12, the optimized CatBoost classifier is more accurate than the CatBoost classifier without optimization, which is 0.81± 0.24. By using hybrid algorithms, SVM, RF, and NB automatically become more accurate. Furthermore, in terms of accuracy, SVM and RF (0.97±0.08) achieve equivalent and higher classification accuracy than NB (0.91±0.12). The findings of relevant biomedical studies confirm the findings of the selected genes.
Collapse
|
15
|
Ahn S, Grimes T, Datta S. The Analysis of Gene Expression Data Incorporating Tumor Purity Information. Front Genet 2021; 12:642759. [PMID: 34497631 PMCID: PMC8419469 DOI: 10.3389/fgene.2021.642759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 07/30/2021] [Indexed: 12/03/2022] Open
Abstract
The tumor microenvironment is composed of tumor cells, stroma cells, immune cells, blood vessels, and other associated non-cancerous cells. Gene expression measurements on tumor samples are an average over cells in the microenvironment. However, research questions often seek answers about tumor cells rather than the surrounding non-tumor tissue. Previous studies have suggested that the tumor purity (TP)-the proportion of tumor cells in a solid tumor sample-has a confounding effect on differential expression (DE) analysis of high vs. low survival groups. We investigate three ways incorporating the TP information in the two statistical methods used for analyzing gene expression data, namely, differential network (DN) analysis and DE analysis. Analysis 1 ignores the TP information completely, Analysis 2 uses a truncated sample by removing the low TP samples, and Analysis 3 uses TP as a covariate in the underlying statistical models. We use three gene expression data sets related to three different cancers from the Cancer Genome Atlas (TCGA) for our investigation. The networks from Analysis 2 have greater amount of differential connectivity in the two networks than that from Analysis 1 in all three cancer datasets. Similarly, Analysis 1 identified more differentially expressed genes than Analysis 2. Results of DN and DE analyses using Analysis 3 were mostly consistent with those of Analysis 1 across three cancers. However, Analysis 3 identified additional cancer-related genes in both DN and DE analyses. Our findings suggest that using TP as a covariate in a linear model is appropriate for DE analysis, but a more robust model is needed for DN analysis. However, because true DN or DE patterns are not known for the empirical datasets, simulated datasets can be used to study the statistical properties of these methods in future studies.
Collapse
Affiliation(s)
| | | | - Somnath Datta
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| |
Collapse
|
16
|
Yang H, Zhuang Z, Pan W. A graph convolutional neural network for gene expression data analysis with multiple gene networks. Stat Med 2021; 40:5547-5564. [PMID: 34258781 DOI: 10.1002/sim.9140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 04/07/2021] [Accepted: 06/21/2021] [Indexed: 02/01/2023]
Abstract
Spectral graph convolutional neural networks (GCN) are proposed to incorporate important information contained in graphs such as gene networks. In a standard spectral GCN, there is only one gene network to describe the relationships among genes. However, for genomic applications, due to condition- or tissue-specific gene function and regulation, multiple gene networks may be available; it is unclear how to apply GCNs to disease classification with multiple networks. Besides, which gene networks may provide more effective prior information for a given learning task is unknown a priori and is not straightforward to discover in many cases. A deep multiple graph convolutional neural network is therefore developed here to meet the challenge. The new approach not only computes a feature of a gene as the weighted average of those of itself and its neighbors through spectral GCNs, but also extracts features from gene-specific expression (or other feature) profiles via a feed-forward neural networks (FNN). We also provide two measures, the importance of a given gene and the relative importance score of each gene network, for the genes' and gene networks' contributions, respectively, to the learning task. To evaluate the new method, we conduct real data analyses using several breast cancer and diffuse large B-cell lymphoma datasets and incorporating multiple gene networks obtained from "GIANT 2.0" Compared with the standard FNN, GCN, and random forest, the new method not only yields high classification accuracy but also prioritizes the most important genes confirmed to be highly associated with cancer, strongly suggesting the usefulness of the new method in incorporating multiple gene networks.
Collapse
Affiliation(s)
- Hu Yang
- School of Information, Central University of Finance and Economics, Beijing, China
| | - Zhong Zhuang
- Department of EECE, University of Minnesota, Minneapolis, Minnesota, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
17
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
18
|
Wang HC, Chiang CJ, Liu TC, Wu CC, Chen YT, Chang JG, Shieh GS. Immunohistochemical Expression of Five Protein Combinations Revealed as Prognostic Markers in Asian Oral Cancer. Front Genet 2021; 12:643461. [PMID: 33936170 PMCID: PMC8083901 DOI: 10.3389/fgene.2021.643461] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 03/01/2021] [Indexed: 12/24/2022] Open
Abstract
Oral squamous cell carcinoma (OSCC) has a high mortality rate (∼50%), and the 5-year overall survival rate is not optimal. Cyto- and histopathological examination of cancer tissues is the main strategy for diagnosis and treatment. In the present study, we aimed to uncover immunohistochemical (IHC) markers for prognosis in Asian OSCC. From the collected 742 synthetic lethal gene pairs (of various cancer types), we first filtered genes relevant to OSCC, performed 29 IHC stains at different cellular portions and combined these IHC stains into 398 distinct pairs. Next, we identified novel IHC prognostic markers in OSCC among Taiwanese population, from the single and paired IHC staining by univariate Cox regression analysis. Increased nuclear expression of RB1 [RB1(N)↑], CDH3(C)↑-STK17A(N)↑ and FLNA(C)↑-KRAS(C)↑were associated with survival, but not independent of tumor stage, where C and N denote cytoplasm and nucleus, respectively. Furthermore, multivariate Cox regression analyses revealed that CSNK1E(C)↓-SHC1(N)↓ (P = 5.9 × 10–5; recommended for clinical use), BRCA1(N)↓-SHC1(N)↓ (P = 0.030), CSNK1E(C)↓-RB1(N)↑ (P = 0.045), [CSNK1E(C)-SHC1(N), FLNA(C)-KRAS(C)] (P = 0.000, rounded to three decimal places) and [BRCA1(N)-SHC1(N), FLNA(C)-KRAS(C)] (P = 0.020) were significant factors of poor prognosis, independent of lymph node metastasis, stage and alcohol consumption. An external dataset from The Cancer Genome Atlas HNSCC cohort confirmed that CDH3↑-STK17A↑ was a significant predictor of poor survival. Our approach identified prognostic markers with components involved in different pathways and revealed IHC marker pairs while neither single IHC was a marker, thus it improved the current state-of-the-art for identification of IHC markers.
Collapse
Affiliation(s)
- Hui-Ching Wang
- Graduate Institute of Clinical Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan.,Division of Hematology and Oncology, Department of Internal Medicine, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan
| | | | - Ta-Chih Liu
- Department of Hematology-Oncology, Chang Bing Show Chwan Memorial Hospital, Changhua, Taiwan
| | - Chun-Chieh Wu
- Department of Pathology, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Yi-Ting Chen
- Department of Pathology, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Jan-Gowth Chang
- Epigenome Research Center, China Medical University Hospital, Taichung, Taiwan.,Department of Laboratory Medicine, China Medical University Hospital, Taichung, Taiwan.,Center for Precision Medicine, China Medical University Hospital, Taichung, Taiwan.,School of Medicine, China Medical University, Taichung, Taiwan.,Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
| | - Grace S Shieh
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan.,Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan.,Genome and Systems Biology Degree Program, Academia Sinica and National Taiwan University, Taipei, Taiwan.,Data Science Degree Program, Academia Sinica and National Taiwan University, Taipei, Taiwan
| |
Collapse
|
19
|
Pirgazi J, Olyaee MH, Khanteymoori A. KFGRNI: A robust method to inference gene regulatory network from time-course gene data based on ensemble Kalman filter. J Bioinform Comput Biol 2021; 19:2150002. [PMID: 33657986 DOI: 10.1142/s0219720021500025] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A central problem of systems biology is the reconstruction of Gene Regulatory Networks (GRNs) by the use of time series data. Although many attempts have been made to design an efficient method for GRN inference, providing a best solution is still a challenging task. Existing noise, low number of samples, and high number of nodes are the main reasons causing poor performance of existing methods. The present study applies the ensemble Kalman filter algorithm to model a GRN from gene time series data. The inference of a GRN is decomposed with p genes into p subproblems. In each subproblem, the ensemble Kalman filter algorithm identifies the weight of interactions for each target gene. With the use of the ensemble Kalman filter, the expression pattern of the target gene is predicted from the expression patterns of all the remaining genes. The proposed method is compared with several well-known approaches. The results of the evaluation indicate that the proposed method improves inference accuracy and demonstrates better regulatory relations with noisy data.
Collapse
Affiliation(s)
- Jamshid Pirgazi
- Department of Electrical and Computer Engineering, University of Science and Technology of Mazandaran Behshahr, Iran
| | - Mohammad Hossein Olyaee
- Department of Computer Engineering, Engineering Faculty, University of Gonabad, Gonabad, Iran
| | - Alireza Khanteymoori
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Germany.,Department of Computer Engineering, Engineering Faculty, University of Zanjan Zanjan Province, Iran
| |
Collapse
|
20
|
Zhao M, He W, Tang J, Zou Q, Guo F. A comprehensive overview and critical evaluation of gene regulatory network inference technologies. Brief Bioinform 2021; 22:6128842. [PMID: 33539514 DOI: 10.1093/bib/bbab009] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 12/11/2020] [Accepted: 01/06/2021] [Indexed: 12/12/2022] Open
Abstract
Gene regulatory network (GRN) is the important mechanism of maintaining life process, controlling biochemical reaction and regulating compound level, which plays an important role in various organisms and systems. Reconstructing GRN can help us to understand the molecular mechanism of organisms and to reveal the essential rules of a large number of biological processes and reactions in organisms. Various outstanding network reconstruction algorithms use specific assumptions that affect prediction accuracy, in order to deal with the uncertainty of processing. In order to study why a certain method is more suitable for specific research problem or experimental data, we conduct research from model-based, information-based and machine learning-based method classifications. There are obviously different types of computational tools that can be generated to distinguish GRNs. Furthermore, we discuss several classical, representative and latest methods in each category to analyze core ideas, general steps, characteristics, etc. We compare the performance of state-of-the-art GRN reconstruction technologies on simulated networks and real networks under different scaling conditions. Through standardized performance metrics and common benchmarks, we quantitatively evaluate the stability of various methods and the sensitivity of the same algorithm applying to different scaling networks. The aim of this study is to explore the most appropriate method for a specific GRN, which helps biologists and medical scientists in discovering potential drug targets and identifying cancer biomarkers.
Collapse
Affiliation(s)
- Mengyuan Zhao
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Wenying He
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jijun Tang
- University of South Carolina, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
21
|
McDermott MBA, Wang J, Zhao WN, Sheridan SD, Szolovits P, Kohane I, Haggarty SJ, Perlis RH. Deep Learning Benchmarks on L1000 Gene Expression Data. IEEE/ACM Trans Comput Biol Bioinform 2020; 17:1846-1857. [PMID: 30990190 PMCID: PMC6980363 DOI: 10.1109/tcbb.2019.2910061] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.
Collapse
|
22
|
Saha S, Prasad A, Chatterjee P, Basu S, Nasipuri M. Protein function prediction from dynamic protein interaction network using gene expression data. J Bioinform Comput Biol 2020; 17:1950025. [PMID: 31617461 DOI: 10.1142/s0219720019500252] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Computational prediction of functional annotation of proteins is an uphill task. There is an ever increasing gap between functional characterization of protein sequences and deluge of protein sequences generated by large-scale sequencing projects. The dynamic nature of protein interactions is frequently observed which is mostly influenced by any new change of state or change in stimuli. Functional characterization of proteins can be inferred from their interactions with each other, which is dynamic in nature. In this work, we have used a dynamic protein-protein interaction network (PPIN), time course gene expression data and protein sequence information for prediction of functional annotation of proteins. During progression of a particular function, it has also been observed that not all the proteins are active at all time points. For unannotated active proteins, our proposed methodology explores the dynamic PPIN consisting of level-1 and level-2 neighboring proteins at different time points, filtered by Damerau-Levenshtein edit distance to estimate the similarity between two protein sequences and coefficient variation methods to assess the strength of an edge in a network. Finally, from the filtered dynamic PPIN, at each time point, functional annotations of the level-2 proteins are assigned to the unknown and unannotated active proteins through the level-1 neighbor, following a bottom-up strategy. Our proposed methodology achieves an average precision, recall and F-Score of 0.59, 0.76 and 0.61 respectively, which is significantly higher than the reported state-of-the-art methods.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India
| | - Abhimanyu Prasad
- Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India
| | - Piyali Chatterjee
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India
| | - Subhadip Basu
- Department of Computer Science & Engineering, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, India
| | - Mita Nasipuri
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India
| |
Collapse
|
23
|
Geistlinger L, Csaba G, Santarelli M, Ramos M, Schiffer L, Turaga N, Law C, Davis S, Carey V, Morgan M, Zimmer R, Waldron L. Toward a gold standard for benchmarking gene set enrichment analysis. Brief Bioinform 2020; 22:545-556. [PMID: 32026945 PMCID: PMC7820859 DOI: 10.1093/bib/bbz158] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 10/11/2019] [Accepted: 11/09/2019] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT ludwig.geistlinger@sph.cuny.edu.
Collapse
Affiliation(s)
- Ludwig Geistlinger
- Graduate School of Public Health and Health Policy, City University of New York, New York, NY 10027, USA
| | - Gergely Csaba
- Institute for Implementation Science and Population Health, City University of New York, New York, NY 10027, USA
| | - Mara Santarelli
- Institute for Bioinformatics, Ludwig-Maximilians-Universität München, 80333 Munich, Germany
| | - Marcel Ramos
- Roswell Park Cancer Institute, Buffalo, NY 14203, USA
| | - Lucas Schiffer
- Graduate School of Arts and Sciences, Boston University, Boston, MA 02215, USA
| | - Nitesh Turaga
- Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria 3052, Australia
| | - Charity Law
- Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Sean Davis
- Center for Cancer Research, National Cancer Institute, Bethesda, MD 20892, USA
| | | | | | | | - Levi Waldron
- Graduate School of Public Health and Health Policy, City University of New York, New York, NY 10027, USA
| |
Collapse
|
24
|
Abstract
Individual patient biomarkers have an important role in personalized treatment. Although various high-throughput sequencing technologies are widely used in biological experiments, these are usually conducted only once or a few times for each patient, which makes it a challenging problem to identify biomarkers in individual patients. At present, there is a lack of effective methods to identify biomarkers in individual sample data. Here, we propose a novel method, IBI, to identify biomarkers in individual tumor samples. Experimental results from several tumor data sets showed that the proposed method could effectively find biomarker genes for individual patients, including common biomarkers related to the mechanisms of the development of cancer, which can be used to predict survival and drug response in patients. In summary, these results demonstrate that the proposed method offers a new perspective for analyzing individual samples.
Collapse
Affiliation(s)
- Jie Li
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Dong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
25
|
Rotimi SO, Rotimi OA, Salako AA, Jibrin P, Oyelade J, Iweala EEJ. Gene Expression Profiling Analysis Reveals Putative Phytochemotherapeutic Target for Castration-Resistant Prostate Cancer. Front Oncol 2019; 9:714. [PMID: 31428582 PMCID: PMC6687853 DOI: 10.3389/fonc.2019.00714] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2019] [Accepted: 07/18/2019] [Indexed: 01/16/2023] Open
Abstract
Prostate cancer is the leading cause of cancer death among men globally, with castration development resistant contributing significantly to treatment failure and death. By analyzing the differentially expressed genes between castration-induced regression nadir and castration-resistant regrowth of the prostate, we identified soluble guanylate cyclase 1 subunit alpha as biologically significant to driving castration-resistant prostate cancer. A virtual screening of the modeled protein against 242 experimentally-validated anti-prostate cancer phytochemicals revealed potential drug inhibitors. Although, the identified four non-synonymous somatic point mutations of the human soluble guanylate cyclase 1 gene could alter its form and ligand binding ability, our analysis identified compounds that could effectively inhibit the mutants together with wild-type. Of the identified phytochemicals, (8′R)-neochrome and (8′S)-neochrome derived from the Spinach (Spinacia oleracea) showed the highest binding energies against the wild and mutant proteins. Our results identified the neochromes and other phytochemicals as leads in pharmacotherapy and as nutraceuticals in management and prevention of castration-resistance prostate cancers.
Collapse
Affiliation(s)
- Solomon Oladapo Rotimi
- Department of Biochemistry and Molecular Biology Research Laboratory, Covenant University, Ota, Nigeria
| | | | | | - Paul Jibrin
- Department of Pathology, National Hospital, Abuja, Nigeria
| | - Jelili Oyelade
- Department of Computer and Information Sciences, Covenant University, Ota, Nigeria
| | - Emeka E J Iweala
- Department of Biochemistry and Molecular Biology Research Laboratory, Covenant University, Ota, Nigeria
| |
Collapse
|
26
|
Abstract
Here we report a bio-statistical/informatics tool, ABioTrans, developed in R for gene expression analysis. The tool allows the user to directly read RNA-Seq data files deposited in the Gene Expression Omnibus or GEO database. Operated using any web browser application, ABioTrans provides easy options for multiple statistical distribution fitting, Pearson and Spearman rank correlations, PCA, k-means and hierarchical clustering, differential expression (DE) analysis, Shannon entropy and noise (square of coefficient of variation) analyses, as well as Gene ontology classifications.
Collapse
Affiliation(s)
- Yutong Zou
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
| | - Thuy Tien Bui
- Biotransformation Innovation Platform (BioTrans), Agency for Science, Technology and Research (ASTAR), Singapore, Singapore
| | - Kumar Selvarajoo
- Biotransformation Innovation Platform (BioTrans), Agency for Science, Technology and Research (ASTAR), Singapore, Singapore
| |
Collapse
|
27
|
Acosta JP, Restrepo S, Henao JD, López-Kleine L. Multivariate Method for Inferential Identification of Differentially Expressed Genes in Gene Expression Experiments. J Comput Biol 2019; 26:866-874. [PMID: 31063414 DOI: 10.1089/cmb.2018.0013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Microarray technology is widely recognized as one of the most important tools when it comes to understanding genetic expression in biological processes. In light of the thousands of gene expression level measurements (including measurements across a number of conditions), identifying differentially expressed genes necessarily implies data mining or large-scale multiple testing procedures. To date, advances with regard to this field have been multivariate-descriptive or inferential-univariate in nature and therefore have important limitations regarding the biological validity of detected genes. In the present article, we present a new multivariate inferential method designed to detect active differentially expressed genes in gene expression data. The proposed method estimates false discovery rates using artificial components. Our method excels when applied to the most common gene expression data structures, providing new insights into differentially expressed genes. The method described herein was programmed in an R-Bioconductor package called acde that has been available since 2015.
Collapse
Affiliation(s)
- Juan Pablo Acosta
- 1Statistics Department, Universidad Nacional de Colombia, Bogotá D.C., Colombia
| | - Silvia Restrepo
- 2Department of Biological Sciences, Universidad de los Andes, Bogotá D.C., Colombia
| | - Juan David Henao
- 3Faculty of Engineering, Universidad Nacional de Colombia, Bogotá D.C., Colombia
| | | |
Collapse
|
28
|
Guan L, Luo Q, Liang N, Liu H. A prognostic prediction system for hepatocellular carcinoma based on gene co-expression network. Exp Ther Med 2019; 17:4506-4516. [PMID: 31086582 PMCID: PMC6489019 DOI: 10.3892/etm.2019.7494] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2018] [Accepted: 01/25/2019] [Indexed: 12/11/2022] Open
Abstract
In the present study, gene expression data of hepatocellular carcinoma (HCC) were analyzed by using a multi-step Bioinformatics approach to establish a novel prognostic prediction system. Gene expression profiles were downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. The overlapping differentially expressed genes (DEGs) between these two datasets were identified using the limma package in R. Prognostic genes were further identified by Cox regression using the survival package. The significantly co-expressed gene pairs were selected using the R function cor to construct the co-expression network. Functional and module analyses were also performed. Next, a prognostic prediction system was established by Bayes discriminant analysis using the discriminant.bayes function in the e1071 package, which was further validated in another independent GEO dataset. A total of 177 overlapping DEGs were identified from TCGA and the GEO dataset (GSE36376). Furthermore, 161 prognostic genes were selected and the top six were stanniocalcin 2, carbonic anhydrase 12, cell division cycle (CDC) 20, deoxyribonuclease 1 like 3, glucosylceramidase β3 and metallothionein 1G. A gene co-expression network involving 41 upregulated and 52 downregulated genes was constructed. SPC24, endothelial cell specific molecule 1, CDC20, CDCA3, cyclin (CCN) E1 and chromatin licensing and DNA replication factor 1 were significantly associated with cell division, mitotic cell cycle and positive regulation of cell proliferation. CCNB1, CCNE1, CCNB2 and stratifin were clearly associated with the p53 signaling pathway. A prognostic prediction system containing 55 signature genes was established and then validated in the GEO dataset GSE20140. In conclusion, the present study identified a number of prognostic genes and established a prediction system to assess the prognosis of HCC patients.
Collapse
Affiliation(s)
- Lianyue Guan
- Department of Hepatobiliary-Pancreatic Surgery, China-Japan Union Hospital of Jilin University, Changchun, Jilin 130033, P.R. China
| | - Qiang Luo
- Department of Ultrasound, China-Japan Union Hospital of Jilin University, Changchun, Jilin 130033, P.R. China
| | - Na Liang
- Office of Surgical Nursing, Changchun Medical College, Changchun, Jilin 130000, P.R. China
| | - Hongyu Liu
- Department of Hepatobiliary-Pancreatic Surgery, China-Japan Union Hospital of Jilin University, Changchun, Jilin 130033, P.R. China
| |
Collapse
|
29
|
Teran Hidalgo SJ, Zhu T, Wu M, Ma S. Overlapping clustering of gene expression data using penalized weighted normalized cut. Genet Epidemiol 2018; 42:796-811. [PMID: 30302823 PMCID: PMC6239939 DOI: 10.1002/gepi.22164] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 07/24/2018] [Accepted: 08/28/2018] [Indexed: 02/06/2023]
Abstract
Clustering has been widely conducted in the analysis of gene expression data. For complex diseases, it has played an important role in identifying unknown functions of genes, serving as the basis of other analysis, and others. A common limitation of most existing clustering approaches is to assume that genes are separated into disjoint clusters. As genes often have multiple functions and thus can belong to more than one functional cluster, the disjoint clustering results can be unsatisfactory. In addition, due to the small sample sizes of genetic profiling studies and other factors, there may not be sufficient evidence to confirm the specific functions of some genes and cluster them definitively into disjoint clusters. In this study, we develop an effective overlapping clustering approach, which takes account into the multiplicity of gene functions and lack of certainty in practical analysis. A penalized weighted normalized cut (PWNCut) criterion is proposed based on the NCut technique and an <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msub><mml:mi>L</mml:mi> <mml:mn>2</mml:mn></mml:msub> </mml:math> norm constraint. It outperforms multiple competitors in simulation. The analysis of the cancer genome atlas (TCGA) data on breast cancer and cervical cancer leads to biologically sensible findings which differ from those using the alternatives. To facilitate implementation, we develop the function pwncut in the R package NCutYX.
Collapse
Affiliation(s)
| | - Tingyu Zhu
- Department of Statistics, Xiamen University, Xiamen, China
| | - Mengyun Wu
- Department of Biostatistics, Yale University, New Haven, Connecticut.,School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
30
|
Makhijani RK, Raut SA, Purohit HJ. Fold change based approach for identification of significant network markers in breast, lung and prostate cancer. IET Syst Biol 2018; 12:213-218. [PMID: 30259866 PMCID: PMC8687202 DOI: 10.1049/iet-syb.2018.0012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Revised: 04/12/2018] [Accepted: 04/15/2018] [Indexed: 12/17/2022] Open
Abstract
Cancer belongs to a class of highly aggressive diseases and a leading cause of death in the world. With more than 100 types of cancers, breast, lung and prostate cancer remain to be the most common types. To identify essential network markers (NMs) and therapeutic targets in these cancers, the authors present a novel approach which uses gene expression data from microarray and RNA-seq platforms and utilises the results from this data to evaluate protein-protein interaction (PPI) network. Differentially expressed genes (DEGs) are extracted from microarray data using three different statistical methods in R, to produce a consistent set of genes. Also, DEGs are extracted from RNA-seq data for the same three cancer types. DEG sets found to be common in both platforms are obtained at three fold change (FC) cut-off levels to accurately identify the level of change in expression of these genes in all three cancers. A cancer network is built using PPI data characterising gene sets at log-FC (LFC)>1, LFC>1.5 and LFC>2, and interconnection between principal hub nodes of these networks is observed. Resulting network of hubs at three FC levels highlights prime NMs with high confidence in multiple cancers as validated by Gene Ontology functional enrichment and maximal complete subgraphs from CFinder.
Collapse
Affiliation(s)
- Richa K Makhijani
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur (MS) 440010, India.
| | - Shital A Raut
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur (MS) 440010, India
| | - Hemant J Purohit
- Environmental Genomics Division, National Environmental Engineering Research Institute, Nagpur (MS) 440020, India
| |
Collapse
|
31
|
Li Y, Bie R, Teran Hidalgo SJ, Qin Y, Wu M, Ma S. Assisted gene expression-based clustering with AWNCut. Stat Med 2018; 37:4386-4403. [PMID: 30094873 DOI: 10.1002/sim.7928] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2018] [Revised: 05/15/2018] [Accepted: 07/05/2018] [Indexed: 01/06/2023]
Abstract
In the research on complex diseases, gene expression (GE) data have been extensively used for clustering samples. The clusters so generated can serve as the basis for disease subtype identification, risk stratification, and many other purposes. With the small sample sizes of genetic profiling studies and noisy nature of GE data, clustering analysis results are often unsatisfactory. In the most recent studies, a prominent trend is to conduct multidimensional profiling, which collects data on GEs and their regulators (copy number alterations, microRNAs, methylation, etc.) on the same subjects. With the regulation relationships, regulators contain important information on the properties of GEs. We develop a novel assisted clustering method, which effectively uses regulator information to improve clustering analysis using GE data. To account for the fact that not all GEs are informative, we propose a weighted strategy, where the weights are determined data-dependently and can discriminate informative GEs from noises. The proposed method is built on the NCut technique and effectively realized using a simulated annealing algorithm. Simulations demonstrate that it can well outperform multiple direct competitors. In the analysis of TCGA cutaneous melanoma and lung adenocarcinoma data, biologically sensible findings different from the alternatives are made.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China
| | - Ruofan Bie
- School of Statistics, Renmin University of China, Beijing, China
| | | | - Yichen Qin
- Department of Operations, Business Analytics, and Information Systems, University of Cincinnati, Cincinnati, Ohio
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Shuangge Ma
- School of Statistics, Renmin University of China, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
32
|
Shahbeig S, Rahideh A, Helfroush MS, Kazemi K. Gene expression feature selection for prostate cancer diagnosis using a two-phase heuristic-deterministic search strategy. IET Syst Biol 2018; 12:162-169. [PMID: 33451186 DOI: 10.1049/iet-syb.2017.0044] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2017] [Revised: 02/19/2018] [Accepted: 03/08/2018] [Indexed: 01/28/2023] Open
Abstract
Here, a two-phase search strategy is proposed to identify the biomarkers in gene expression data set for the prostate cancer diagnosis. A statistical filtering method is initially employed to remove the noisiest data. In the first phase of the search strategy, a multi-objective optimisation based on the binary particle swarm optimisation algorithm tuned by a chaotic method is proposed to select the optimal subset of genes with the minimum number of genes and the maximum classification accuracy. Finally, in the second phase of the search strategy, the cache-based modification of the sequential forward floating selection algorithm is used to find the most discriminant genes from the optimal subset of genes selected in the first phase. The results of applying the proposed algorithm on the available challenging prostate cancer data set demonstrate that the proposed algorithm can perfectly identify the informative genes such that the classification accuracy, sensitivity, and specificity of 100% are achieved with only nine biomarkers.
Collapse
Affiliation(s)
- Saleh Shahbeig
- Department of Electrical and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran
| | - Akbar Rahideh
- Department of Electrical and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran
| | | | - Kamran Kazemi
- Department of Electrical and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran
| |
Collapse
|
33
|
Irimie AI, Braicu C, Cojocneanu R, Magdo L, Onaciu A, Ciocan C, Mehterov N, Dudea D, Buduru S, Berindan-Neagoe I. Differential Effect of Smoking on Gene Expression in Head and Neck Cancer Patients. Int J Environ Res Public Health 2018; 15:ijerph15071558. [PMID: 30041465 PMCID: PMC6069101 DOI: 10.3390/ijerph15071558] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Revised: 07/11/2018] [Accepted: 07/17/2018] [Indexed: 12/13/2022]
Abstract
Smoking is a well-known behavior that has an important negative impact on human health, and is considered to be a significant factor related to the development and progression of head and neck squamous cell carcinomas (HNSCCs). Use of high-dimensional datasets to discern novel HNSCC driver genes related to smoking represents an important challenge. The Cancer Genome Atlas (TCGA) analysis was performed in three co-existing groups of HNSCC in order to assess whether gene expression landscape is affected by tobacco smoking, having quit, or non-smoking status. We identified a set of differentially expressed genes that discriminate between smokers and non-smokers or based on human papilloma virus (HPV)16 status, or the co-occurrence of these two exposome components in HNSCC. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways classification shows that most of the genes are specific to cellular metabolism, emphasizing metabolic detoxification pathways, metabolism of chemical carcinogenesis, or drug metabolism. In the case of HPV16-positive patients it has been demonstrated that the altered genes are related to cellular adhesion and inflammation. The correlation between smoking and the survival rate was not statistically significant. This emphasizes the importance of the complex environmental exposure and genetic factors in order to establish prevention assays and personalized care system for HNSCC, with the potential for being extended to other cancer types.
Collapse
Affiliation(s)
- Alexandra Iulia Irimie
- Department of Prosthetic Dentistry and Dental Materials, Division Dental Propaedeutics, Aesthetic, Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca, 23 Marinescu Street, Cluj-Napoca 40015, Romania.
| | - Cornelia Braicu
- Research Center for Functional Genomics and Translational Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 23 Marinescu Street, Cluj-Napoca 40015, Romania.
| | - Roxana Cojocneanu
- Research Center for Functional Genomics and Translational Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 23 Marinescu Street, Cluj-Napoca 40015, Romania.
| | - Lorand Magdo
- Research Center for Functional Genomics and Translational Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 23 Marinescu Street, Cluj-Napoca 40015, Romania.
| | - Anca Onaciu
- MEDFUTURE-Research Center for Advanced Medicine, University of Medicine and Pharmacy Iuliu Hatieganu, 23 Marinescu Street, Cluj-Napoca 40015, Romania.
| | - Cristina Ciocan
- MEDFUTURE-Research Center for Advanced Medicine, University of Medicine and Pharmacy Iuliu Hatieganu, 23 Marinescu Street, Cluj-Napoca 40015, Romania.
| | - Nikolay Mehterov
- Department of Medical Biology, Faculty of Medicine, Medical University-Plovdiv, 15-А Vassil Aprilov Blvd., Plovdiv 4000, Bulgaria.
- Technological Center for Emergency Medicine, 15-А Vassil Aprilov Blvd., Plovdiv 4000, Bulgaria.
| | - Diana Dudea
- Department of Prosthetic Dentistry and Dental Materials, Division Dental Propaedeutics, Aesthetic, Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca, 23 Marinescu Street, Cluj-Napoca 40015, Romania.
| | - Smaranda Buduru
- Prosthetics and Dental Materials, Faculty of Dental Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca, 32 Clinicilor Street, Cluj-Napoca 400006, Romania.
| | - Ioana Berindan-Neagoe
- Research Center for Functional Genomics and Translational Medicine, Iuliu Hatieganu University of Medicine and Pharmacy, 23 Marinescu Street, Cluj-Napoca 40015, Romania.
- Department of Medical Biology, Faculty of Medicine, Medical University-Plovdiv, 15-А Vassil Aprilov Blvd., Plovdiv 4000, Bulgaria.
- Department of Functional Genomics and Experimental Pathology, The Oncology Institute Ion Chiricuta, Republicii 34th Street, Cluj-Napoca 400015, Romania.
| |
Collapse
|
34
|
Xia CQ, Han K, Qi Y, Zhang Y, Yu DJ. A Self-Training Subspace Clustering Algorithm under Low-Rank Representation for Cancer Classification on Gene Expression Data. IEEE/ACM Trans Comput Biol Bioinform 2018; 15:1315-1324. [PMID: 28600258 PMCID: PMC5986621 DOI: 10.1109/tcbb.2017.2712607] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation (LRR) is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering (SSC) method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes (RNF114, HLA-DRB5, USP9Y, and PTPN20) were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.
Collapse
|
35
|
Sheng Y, Tang J, Ren K, Manor LC, Cao H. Integrative computational approach to evaluate risk genes for postmenopausal osteoporosis. IET Syst Biol 2018; 12:118-122. [PMID: 29745905 PMCID: PMC8687217 DOI: 10.1049/iet-syb.2017.0043] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2017] [Revised: 01/03/2018] [Accepted: 01/19/2018] [Indexed: 02/01/2024] Open
Abstract
In recent years, numerous studies reported over a hundred of genes playing roles in the etiology of postmenopausal osteoporosis (PO). However, many of these candidate genes were lack of replication and results were not always consistent. Here, the authors proposed a computational workflow to curate and evaluate PO related genes. They integrate large-scale literature knowledge data and gene expression data (PO case/control: 10/10) for the marker evaluation. Pathway enrichment, sub-network enrichment, and gene-gene interaction analysis were conducted to study the pathogenic profile of the candidate genes, with four metrics proposed and validated for each gene. By using the authors' approach, a scalable PO genetic database was developed; including PO related genes, diseases, pathways, and the supporting references. The PO case/control classification supported the effectiveness of the four proposed metrics, which successfully identified eight well-studied top PO genes (e.g. TGFB1, IL6, IL1B, TNF, ESR2, IGF1, HIF1A, and COL1A1) and highlighted one recently reported PO genes (e.g. IFNG). The computational biology approach and the PO database developed in this study provide a valuable resource which may facilitate understanding the genetic profile of PO.
Collapse
Affiliation(s)
- Yingjun Sheng
- Department of Orthopedics, Tongling People's Hospital, Tongling, Anhui Province 244000, People's Republic of China
| | - Jilei Tang
- Department of Orthopedics, Qidong People's Hospital, Nantong 226200, People's Republic of China
| | - Kewei Ren
- Department of Orthopedics, The Affiliated Jiangyin Hospital of Southeast University Medical School, Jiangyin 214400, People's Republic of China.
| | - Lydia C Manor
- Division of Pediatric Surgery, Children's National Health Systems, Washington DC, 20010, USA
| | - Hongbao Cao
- Department of Biology Product, Elsevier Inc, Rockville, MD, 20852, USA
| |
Collapse
|
36
|
Li S, Liu X, Li H, Pan H, Acharya A, Deng Y, Yu Y, Haak R, Schmidt J, Schmalz G, Ziebolz D. Integrated analysis of long noncoding RNA-associated competing endogenous RNA network in periodontitis. J Periodontal Res 2018. [PMID: 29516510 DOI: 10.1111/jre.12539] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
BACKGROUND AND OBJECTIVES Long noncoding RNAs (lncRNAs) play critical and complex roles in regulating various biological processes of periodontitis. This bioinformatic study aims to construct a putative competing endogenous RNA (ceRNA) network by integrating lncRNA, miRNA and mRNA expression, based on high-throughput RNA sequencing and microarray data about periodontitis. MATERIAL AND METHODS Data from 1 miRNA and 3 mRNA expression profiles were obtained to construct the lncRNA-associated ceRNA network. Gene Ontology enrichment analysis and pathway analysis were performed using the Gene Ontology website and Kyoto Encyclopedia of Genes and Genomes. A protein-protein interaction network was constructed based on the Search Tool for the retrieval of Interacting Genes/Proteins. Transcription factors (TFs) of differentially expressed genes were identified based on TRANSFAC database and then a regulatory network was constructed. RESULTS Through constructing the dysregulated ceRNA network, 6 genes (HSPA4L, PANK3, YOD1, CTNNBIP1, EVI2B, ITGAL) and 3 miRNAs (miR-125a-3p, miR-200a, miR-142-3p) were detected. Three lncRNAs (MALAT1, TUG1, FGD5-AS1) were found to target both miR-125a-3p and miR-142-3p in this ceRNA network. Protein-protein interaction network analysis identified several hub genes, including VCAM1, ITGA4, UBC, LYN and SSX2IP. Three pathways (cytokine-cytokine receptor, cell adhesion molecules, chemokine signaling pathway) were identified to be overlapping results with the previous bioinformatics studies in periodontitis. Moreover, 2 TFs including FOS and EGR were identified to be involved in the regulatory network of the differentially expressed genes-TFs in periodontitis. CONCLUSION These findings suggest that 6 mRNAs (HSPA4L, PANK3, YOD1, CTNNBIP1, EVI2B, ITGAL), 3 miRNAs (hsa-miR-125a-3p, hsa-miR-200a, hsa-miR-142-3p) and 3 lncRNAs (MALAT1, TUG1, FGD5-AS1) might be involved in the lncRNA-associated ceRNA network of periodontitis. This study sought to illuminate further the genetic and epigenetic mechanisms of periodontitis through constructing an lncRNA-associated ceRNA network.
Collapse
Affiliation(s)
- S Li
- Department of Cariology, Endodontology and Periodontology, University of Leipzig, Leipzig, Germany
| | - X Liu
- Shanghai Genomap Technologies, Yangpu District, Shanghai, China.,College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang Province, China
| | - H Li
- Saxon Incubator for Clinical Translation (SIKT), University of Leipzig, Leipzig, Germany
| | - H Pan
- Department of Orthopedic Surgery, Brigham and Women's Hospital, Harvard Medical School, Harvard University, Boston, MA, USA
| | - A Acharya
- Faculty of Dentistry, University of Hong Kong, Hong Kong, China.,Dr D Y Patil Dental College and Hospital, Dr D Y Patil Vidyapeeth, Pimpri, Pune, India
| | - Y Deng
- Shanghai Genomap Technologies, Yangpu District, Shanghai, China
| | - Y Yu
- Department of Periodontology, The Stomatology Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang Province, China
| | - R Haak
- Department of Cariology, Endodontology and Periodontology, University of Leipzig, Leipzig, Germany
| | - J Schmidt
- Department of Cariology, Endodontology and Periodontology, University of Leipzig, Leipzig, Germany
| | - G Schmalz
- Department of Cariology, Endodontology and Periodontology, University of Leipzig, Leipzig, Germany
| | - D Ziebolz
- Department of Cariology, Endodontology and Periodontology, University of Leipzig, Leipzig, Germany
| |
Collapse
|
37
|
Abstract
Objective: Cancer diagnosis is one of the most vital emerging clinical applications of microarray data. Due to the high dimensionality, gene selection is an important step for improving expression data classification performance. There is therefore a need for effective methods to select informative genes for prediction and diagnosis of cancer. The main objective of this research was to derive a heuristic approach to select highly informative genes. Methods: A metaheuristic approach with a Genetic Algorithm with Levy Flight (GA-LV) was applied for classification of cancer genes in microarrays. The experimental results were analyzed with five major cancer gene expression benchmark datasets. Result: GA-LV proved superior to GA and statistical approaches, with 100% accuracy for the dataset for Leukemia, Lung and Lymphoma. For Prostate and Colon datasets the GA-LV was 99.5% and 99.2% accurate, respectively. Conclusion: The experimental results show that the proposed approach is suitable for effective gene selection with all benchmark datasets, removing irrelevant and redundant genes to improve classification accuracy.
Collapse
Affiliation(s)
- Pyingkodi M
- Department of Computer Applications, Kongu Engineering College Erode, TamilNadu, India.
| | | |
Collapse
|
38
|
M P, R B, N S. Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach. Asian Pac J Cancer Prev 2017; 18:3451-3455. [PMID: 29286618 PMCID: PMC5980909 DOI: 10.22034/apjcp.2017.18.12.3451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Objective: A better understanding of functional genomics can be obtained by extracting patterns hidden in gene expression data. This could have paramount implications for cancer diagnosis, gene treatments and other domains. Clustering may reveal natural structures and identify interesting patterns in underlying data. The main objective of this research was to derive a heuristic approach to detection of highly co-expressed genes related to cancer from gene expression data with minimum Mean Squared Error (MSE). Methods: A modified CSO algorithm using Harmony Search (MCSO-HS) for clustering cancer gene expression data was applied. Experiment results are analyzed using two cancer gene expression benchmark datasets, namely for leukaemia and for breast cancer. Result: The results indicated MCSO-HS to be better than HS and CSO, 13% and 9% with the leukaemia dataset. For breast cancer dataset improvement was by 22% and 17%, respectively, in terms of MSE. Conclusion: The results showed MCSO-HS to outperform HS and CSO with both benchmark datasets. To validate the clustering results, this work was tested with internal and external cluster validation indices. Also this work points to biological validation of clusters with gene ontology in terms of function, process and component.
Collapse
Affiliation(s)
- Pandi M
- Department of Computer Science and Engineering, Bannari Amman Institute of Technology, Sathyamangalam, Erode, India.
| | | | | |
Collapse
|
39
|
Liu J, Wang X, Cheng Y, Zhang L. Tumor gene expression data classification via sample expansion-based deep learning. Oncotarget 2017; 8:109646-109660. [PMID: 29312636 PMCID: PMC5752549 DOI: 10.18632/oncotarget.22762] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 10/29/2017] [Indexed: 12/15/2022] Open
Abstract
Since tumor is seriously harmful to human health, effective diagnosis measures are in urgent need for tumor therapy. Early detection of tumor is particularly important for better treatment of patients. A notable issue is how to effectively discriminate tumor samples from normal ones. Many classification methods, such as Support Vector Machines (SVMs), have been proposed for tumor classification. Recently, deep learning has achieved satisfactory performance in the classification task of many areas. However, the application of deep learning is rare in tumor classification due to insufficient training samples of gene expression data. In this paper, a Sample Expansion method is proposed to address the problem. Inspired by the idea of Denoising Autoencoder (DAE), a large number of samples are obtained by randomly cleaning partially corrupted input many times. The expanded samples can not only maintain the merits of corrupted data in DAE but also deal with the problem of insufficient training samples of gene expression data to a certain extent. Since Stacked Autoencoder (SAE) and Convolutional Neural Network (CNN) models show excellent performance in classification task, the applicability of SAE and 1-dimensional CNN (1DCNN) on gene expression data is analyzed. Finally, two deep learning models, Sample Expansion-Based SAE (SESAE) and Sample Expansion-Based 1DCNN (SE1DCNN), are designed to carry out tumor gene expression data classification by using the expanded samples. Experimental studies indicate that SESAE and SE1DCNN are very effective in tumor classification.
Collapse
Affiliation(s)
- Jian Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Xuesong Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Yuhu Cheng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| |
Collapse
|
40
|
Liu M, Fan X, Fang K, Zhang Q, Ma S. Integrative sparse principal component analysis of gene expression data. Genet Epidemiol 2017; 41:844-865. [PMID: 29114920 DOI: 10.1002/gepi.22089] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 10/03/2017] [Accepted: 10/04/2017] [Indexed: 12/16/2022]
Abstract
In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance.
Collapse
Affiliation(s)
- Mengque Liu
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Xinyan Fan
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Wang Yanan Institute of Economics Studies, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Wang Yanan Institute of Economics Studies, Xiamen University, Xiamen, China.,Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| |
Collapse
|
41
|
Abstract
This article reports a new clustering method based on the k-means algorithm to high-dimensional gene expression data. The proposed approach makes use of bidirectional penalties to constrain the number of clusters and centroids of clusters to simultaneously determine the unknown number of clusters and handle large amounts of noise in gene expression data. Numeric studies indicate that this algorithm not only performs better in clustering but is also comparable to other approaches in its ability to obtain the correct number of clusters and correct signal features. Finally, we apply the proposed approach to analyze two benchmark gene expression datasets. These analyses again indicate that the proposed algorithm performs well in clustering high-dimensional gene expression data with an unknown number of clusters.
Collapse
Affiliation(s)
- Hu Yang
- 1 School of Information, Central University of Finance and Economics , Beijing, China
| | - Xiaoqin Liu
- 2 The National Center for Register-Based Research, Aarhus University , Aarhus, Demark
| |
Collapse
|
42
|
Abstract
Breast cancer is a common malignancy among women with a rising incidence. Our intention was to detect transcription factors (TFs) for deeper understanding of the underlying mechanisms of breast cancer. Integrated analysis of gene expression datasets of breast cancer was performed. Then, functional annotation of differentially expressed genes (DEGs) was conducted, including Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. Furthermore, TFs were identified and a global transcriptional regulatory network was constructed. Seven publically available GEO datasets were obtained, and a set of 1196 DEGs were identified (460 up-regulated and 736 down-regulated). Functional annotation results showed that cell cycle was the most significantly enriched pathway, which was consistent with the fact that cell cycle is closely related to various tumors. Fifty-three differentially expressed TFs were identified, and the regulatory networks consisted of 817 TF-target interactions between 46 TFs and 602 DEGs in the context of breast cancer. Top 10 TFs covering the most downstream DEGs were SOX10, NFATC2, ZNF354C, ARID3A, BRCA1, FOXO3, GATA3, ZEB1, HOXA5 and EGR1. The transcriptional regulatory networks could enable a better understanding of regulatory mechanisms of breast cancer pathology and provide an opportunity for the development of potential therapy.
Collapse
Affiliation(s)
- Hongyan Zang
- a Department of Breast Surgery , Yantaishan Hospital , Yantai , China and
| | - Ning Li
- b Department of Human Anatomy , School of Basic Medicine, Shandong University of Traditional Chinese Medicine , Jinan , China
| | - Yuling Pan
- b Department of Human Anatomy , School of Basic Medicine, Shandong University of Traditional Chinese Medicine , Jinan , China
| | - Jingguang Hao
- a Department of Breast Surgery , Yantaishan Hospital , Yantai , China and
| |
Collapse
|
43
|
Lustgarten JL, Balasubramanian JB, Visweswaran S, Gopalakrishnan V. Learning Parsimonious Classification Rules from Gene Expression Data Using Bayesian Networks with Local Structure. Data. 2017;2:5. [PMID: 28331847 PMCID: PMC5358670 DOI: 10.3390/data2010005] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The comprehensibility of good predictive models learned from high-dimensional gene expression data is attractive because it can lead to biomarker discovery. Several good classifiers provide comparable predictive performance but differ in their abilities to summarize the observed data. We extend a Bayesian Rule Learning (BRL-GSS) algorithm, previously shown to be a significantly better predictor than other classical approaches in this domain. It searches a space of Bayesian networks using a decision tree representation of its parameters with global constraints, and infers a set of IF-THEN rules. The number of parameters and therefore the number of rules are combinatorial to the number of predictor variables in the model. We relax these global constraints to a more generalizable local structure (BRL-LSS). BRL-LSS entails more parsimonious set of rules because it does not have to generate all combinatorial rules. The search space of local structures is much richer than the space of global structures. We design the BRL-LSS with the same worst-case time-complexity as BRL-GSS while exploring a richer and more complex model space. We measure predictive performance using Area Under the ROC curve (AUC) and Accuracy. We measure model parsimony performance by noting the average number of rules and variables needed to describe the observed data. We evaluate the predictive and parsimony performance of BRL-GSS, BRL-LSS and the state-of-the-art C4.5 decision tree algorithm, across 10-fold cross-validation using ten microarray gene-expression diagnostic datasets. In these experiments, we observe that BRL-LSS is similar to BRL-GSS in terms of predictive performance, while generating a much more parsimonious set of rules to explain the same observed data. BRL-LSS also needs fewer variables than C4.5 to explain the data with similar predictive performance. We also conduct a feasibility study to demonstrate the general applicability of our BRL methods on the newer RNA sequencing gene-expression data.
Collapse
|
44
|
Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E. Clustering Algorithms: Their Application to Gene Expression Data. Bioinform Biol Insights 2016; 10:237-253. [PMID: 27932867 PMCID: PMC5135122 DOI: 10.4137/bbi.s38316] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Revised: 09/05/2016] [Accepted: 09/09/2016] [Indexed: 12/17/2022] Open
Abstract
Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure.
Collapse
Affiliation(s)
- Jelili Oyelade
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Itunuoluwa Isewon
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Funke Oladipupo
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Olufemi Aromolaran
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Efosa Uwoghiren
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Faridah Ameh
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Moses Achas
- Department of Computer Science and Information Technology, Bells University of Technology, Ota, Ogun State, Nigeria
| | - Ezekiel Adebiyi
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| |
Collapse
|
45
|
Chang JG, Chen CC, Wu YY, Che TF, Huang YS, Yeh KT, Shieh GS, Yang PC. Uncovering synthetic lethal interactions for therapeutic targets and predictive markers in lung adenocarcinoma. Oncotarget 2016; 7:73664-73680. [PMID: 27655641 PMCID: PMC5342006 DOI: 10.18632/oncotarget.12046] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Accepted: 08/24/2016] [Indexed: 12/28/2022] Open
Abstract
Two genes are called synthetic lethal (SL) if their simultaneous mutation leads to cell death, but mutation of either individual does not. Targeting SL partners of mutated cancer genes can selectively kill cancer cells, but leave normal cells intact. We present an integrated approach to uncover SL gene pairs as novel therapeutic targets of lung adenocarcinoma (LADC). Of 24 predicted SL pairs, PARP1-TP53 was validated by RNAi knockdown to have synergistic toxicity in H1975 and invasive CL1-5 LADC cells; additionally FEN1-RAD54B, BRCA1-TP53, BRCA2-TP53 and RB1-TP53 were consistent with the literature. While metastasis remains a bottleneck in cancer treatment and inhibitors of PARP1 have been developed, this result may have therapeutic potential for LADC, in which TP53 is commonly mutated. We also demonstrated that silencing PARP1 enhanced the cell death induced by the platinum-based chemotherapy drug carboplatin in lung cancer cells (CL1-5 and H1975). IHC of RAD54B↑, BRCA1↓-RAD54B↑, FEN1(N)↑-RAD54B↑ and PARP1↑-RAD54B↑ were shown to be prognostic markers for 131 Asian LADC patients, and all markers except BRCA1↓-RAD54B↑ were further confirmed by three independent gene expression data sets (a total of 426 patients) including The Cancer Genome Atlas (TCGA) cohort of LADC. Importantly, we identified POLB-TP53 and POLB as predictive markers for the TCGA cohort (230 subjects), independent of age and stage. Thus, POLB and POLB-TP53 may be used to stratify future non-Asian LADC patients for therapeutic strategies.
Collapse
Affiliation(s)
- Jan-Gowth Chang
- Department of Laboratory Medicine and Epigenome Research Center, China Medical University Hospital, China Medical University, Taichung, Taiwan
| | - Chia-Cheng Chen
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - Yi-Ying Wu
- Graduate Institute of Clinical Medicine, College of Medicine, National Cheng Kung University, Tainan, Taiwan
| | - Ting-Fang Che
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| | - Yi-Syuan Huang
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - Kun-Tu Yeh
- Department of Pathology, Changhua Christian Hospital, Changhua, Taiwan
- Department of Pathology, School of Medicine, Chung Shan Medical University, Taichung, Taiwan
| | - Grace S. Shieh
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
- Genome and Systems Biology Degree Program, Academia Sinica and National Taiwan University, Taipei, Taiwan
| | - Pan-Chyr Yang
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
- Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan
- Department of Internal Medicine, National Taiwan University Hospital, Taipei, Taiwan
| |
Collapse
|
46
|
Abstract
In recent years several methods have been proposed to assign pairwise mechanism- based similarity scores to human diseases. Despite their differences in approach and performance, these methods work in a somewhat similar manner: first a set of biomolecules (genes, proteins, chemicals, etc.) is associated with each disease, and then a measure is defined to calculate the similarity between the sets assigned to a pair of diseases. Since the similarity score between two diseases is defined based on the underlying molecular processes, a high score may hint at a shared cause, and therefore a similar treatment, for both diseases. This is of great practical importance especially when a rare or newly-discovered disease, for which limited information is available, is found to be related to a disease with a known treatment. Thus, in this mini-review we briefly discuss the recently developed methods for computing mechanism-based disease- disease similarities.
Collapse
Affiliation(s)
- Mehdi B Hamaneh
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
47
|
Song B, Du J, Deng N, Ren JC, Shu ZB. Comparative analysis of gene expression profiles of gastric cardia adenocarcinoma and gastric non-cardia adenocarcinoma. Oncol Lett 2016; 12:3866-3874. [PMID: 27895742 PMCID: PMC5104197 DOI: 10.3892/ol.2016.5161] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Accepted: 08/04/2016] [Indexed: 12/17/2022] Open
Abstract
In the present study, gene expression profiles were analyzed to identify the molecular mechanisms underlying gastric cardia adenocarcinoma (GCA) and gastric non-cardia adenocarcinoma (GNCA). A gene expression dataset (accession number GSE29272) was downloaded from Gene Expression Omnibus, and consisted of 62 GCA samples and 62 normal controls, as well as 72 GNCA samples and 72 normal controls. The two groups of differentially-expressed genes (DEGs) were compared to obtain common and unique DEGs. A differential analysis was performed using the Linear Models for Microarray Data package in R. Functional enrichment analysis was conducted for the DEGs using the Database for Annotation, Visualization and Integrated Discovery. Protein-protein interaction (PPI) networks were constructed for the DEGs with information from the Search Tool for the Retrieval of Interacting Genes. Subnetworks were extracted from the whole network with Cytoscape. Compared with the control, 284 and 268 genes were differentially-expressed in GCA and GNCA, respectively, of which 194 DEGs were common between GCA and GNCA. Common DEGs [e.g., claudin (CLDN)7, CLDN4 and CLDN3] were associated with cell adhesion and digestion. GCA-unique DEGs [e.g., MAD1 mitotic arrest deficient like 1, cyclin (CCN)B1, CCNB2 and CCNE1] were associated with the cell cycle and the regulation of cell proliferation, while GNCA-unique DEGs (e.g., GATA binding protein 6 and hyaluronoglucosaminidase 1) were implicated in cell death. A PPI network with 141 nodes and 446 edges were obtained, from which two subnetworks were extracted. Genes [e.g., fibronectin 1, collagen type I α2 chain (COL1A2) and COL1A1] from the two subnetworks were implicated in extracellular matrix organization. These common DEGs could advance our understanding of the etiology of gastric cancer, while the unique DEGs in GCA and GNCA could better define the properties of specific cancers and provide potential biomarkers for diagnosis, prognosis or therapy.
Collapse
Affiliation(s)
- Bin Song
- Department of Gastrointestinal Surgery, China-Japan Union Hospital, Jilin University, Changchun, Jilin 130033, P.R. China
| | - Juan Du
- Second Department of Internal Medicine, The Tumor Hospital of Jilin, Changchun, Jilin 130012, P.R. China
| | - Neng Deng
- Department of Gastrointestinal Surgery, China-Japan Union Hospital, Jilin University, Changchun, Jilin 130033, P.R. China
| | - Ji-Chen Ren
- Second Department of Internal Medicine, The Tumor Hospital of Jilin, Changchun, Jilin 130012, P.R. China
| | - Zhen-Bo Shu
- Department of Gastrointestinal Surgery, China-Japan Union Hospital, Jilin University, Changchun, Jilin 130033, P.R. China
| |
Collapse
|
48
|
Muetze T, Goenawan IH, Wiencko HL, Bernal-Llinares M, Bryan K, Lynn DJ. Contextual Hub Analysis Tool (CHAT): A Cytoscape app for identifying contextually relevant hubs in biological networks. F1000Res 2016; 5:1745. [PMID: 27853512 PMCID: PMC5105880 DOI: 10.12688/f1000research.9118.1] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/26/2016] [Indexed: 07/30/2023] Open
Abstract
UNLABELLED Highly connected nodes (hubs) in biological networks are topologically important to the structure of the network and have also been shown to be preferentially associated with a range of phenotypes of interest. The relative importance of a hub node, however, can change depending on the biological context. Here, we report a Cytoscape app, the Contextual Hub Analysis Tool (CHAT), which enables users to easily construct and visualize a network of interactions from a gene or protein list of interest, integrate contextual information, such as gene expression or mass spectrometry data, and identify hub nodes that are more highly connected to contextual nodes (e.g. genes or proteins that are differentially expressed) than expected by chance. In a case study, we use CHAT to construct a network of genes that are differentially expressed in Dengue fever, a viral infection. CHAT was used to identify and compare contextual and degree-based hubs in this network. The top 20 degree-based hubs were enriched in pathways related to the cell cycle and cancer, which is likely due to the fact that proteins involved in these processes tend to be highly connected in general. In comparison, the top 20 contextual hubs were enriched in pathways commonly observed in a viral infection including pathways related to the immune response to viral infection. This analysis shows that such contextual hubs are considerably more biologically relevant than degree-based hubs and that analyses which rely on the identification of hubs solely based on their connectivity may be biased towards nodes that are highly connected in general rather than in the specific context of interest. AVAILABILITY CHAT is available for Cytoscape 3.0+ and can be installed via the Cytoscape App Store ( http://apps.cytoscape.org/apps/chat).
Collapse
Affiliation(s)
- Tanja Muetze
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
| | - Ivan H. Goenawan
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
| | - Heather L. Wiencko
- Animal and Bioscience Research Department, Animal and Grassland Research and Innovation Centre, Teagasc, Meath, Ireland
| | - Manuel Bernal-Llinares
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
| | - Kenneth Bryan
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
| | - David J. Lynn
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
- School of Medicine, Flinders University, Bedford Park, Australia
| |
Collapse
|
49
|
Muetze T, Goenawan IH, Wiencko HL, Bernal-Llinares M, Bryan K, Lynn DJ. Contextual Hub Analysis Tool (CHAT): A Cytoscape app for identifying contextually relevant hubs in biological networks. F1000Res 2016; 5:1745. [PMID: 27853512 PMCID: PMC5105880 DOI: 10.12688/f1000research.9118.2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/26/2016] [Indexed: 01/21/2023] Open
Abstract
Highly connected nodes (hubs) in biological networks are topologically important to the structure of the network and have also been shown to be preferentially associated with a range of phenotypes of interest. The relative importance of a hub node, however, can change depending on the biological context. Here, we report a Cytoscape app, the Contextual Hub Analysis Tool (CHAT), which enables users to easily construct and visualize a network of interactions from a gene or protein list of interest, integrate contextual information, such as gene expression or mass spectrometry data, and identify hub nodes that are more highly connected to contextual nodes (e.g. genes or proteins that are differentially expressed) than expected by chance. In a case study, we use CHAT to construct a network of genes that are differentially expressed in Dengue fever, a viral infection. CHAT was used to identify and compare contextual and degree-based hubs in this network. The top 20 degree-based hubs were enriched in pathways related to the cell cycle and cancer, which is likely due to the fact that proteins involved in these processes tend to be highly connected in general. In comparison, the top 20 contextual hubs were enriched in pathways commonly observed in a viral infection including pathways related to the immune response to viral infection. This analysis shows that such
contextual hubs are considerably more biologically relevant than degree-based hubs and that analyses which rely on the identification of hubs solely based on their connectivity may be biased towards nodes that are highly connected in general rather than in the specific context of interest. Availability: CHAT is available for Cytoscape 3.0+ and can be installed via the Cytoscape App Store (
http://apps.cytoscape.org/apps/chat).
Collapse
Affiliation(s)
- Tanja Muetze
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
| | - Ivan H Goenawan
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
| | - Heather L Wiencko
- Animal and Bioscience Research Department, Animal and Grassland Research and Innovation Centre, Teagasc, Meath, Ireland
| | - Manuel Bernal-Llinares
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
| | - Kenneth Bryan
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia
| | - David J Lynn
- EMBL Australia Biomedical Informatics Group, Infection & Immunity Theme, South Australian Medical and Health Research Institute, Adelaide, Australia; School of Medicine, Flinders University, Bedford Park, Australia
| |
Collapse
|
50
|
Abstract
Cluster analysis plays a foremost role in identifying groups of genes that show similar behavior under a set of experimental conditions. Several clustering algorithms have been proposed for identifying gene behaviors and to understand their significance. The principal aim of this work is to develop an intelligent rough clustering technique, which will efficiently remove the irrelevant dimensions in a high-dimensional space and obtain appropriate meaningful clusters. This paper proposes a novel biclustering technique that is based on rough set theory. The proposed algorithm uses correlation coefficient as a similarity measure to simultaneously cluster both the rows and columns of a gene expression data matrix and mean squared residue to generate the initial biclusters. Furthermore, the biclusters are refined to form the lower and upper boundaries by determining the membership of the genes in the clusters using mean squared residue. The algorithm is illustrated with yeast gene expression data and the experiment proves the effectiveness of the method. The main advantage is that it overcomes the problem of selection of initial clusters and also the restriction of one object belonging to only one cluster by allowing overlapping of biclusters.
Collapse
|