1
|
Stable feature selection utilizing Graph Convolutional Neural Network and Layer-wise Relevance Propagation for biomarker discovery in breast cancer. Artif Intell Med 2024; 151:102840. [PMID: 38658129 DOI: 10.1016/j.artmed.2024.102840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 03/05/2024] [Accepted: 03/10/2024] [Indexed: 04/26/2024]
Abstract
High-throughput technologies are becoming increasingly important in discovering prognostic biomarkers and in identifying novel drug targets. With Mammaprint, Oncotype DX, and many other prognostic molecular signatures breast cancer is one of the paradigmatic examples of the utility of high-throughput data to deliver prognostic biomarkers, that can be represented in a form of a rather short gene list. Such gene lists can be obtained as a set of features (genes) that are important for the decisions of a Machine Learning (ML) method applied to high-dimensional gene expression data. Several studies have identified predictive gene lists for patient prognosis in breast cancer, but these lists are unstable and have only a few genes in common. Instability of feature selection impedes biological interpretability: genes that are relevant for cancer pathology should be members of any predictive gene list obtained for the same clinical type of patients. Stability and interpretability of selected features can be improved by including information on molecular networks in ML methods. Graph Convolutional Neural Network (GCNN) is a contemporary deep learning approach applicable to gene expression data structured by a prior knowledge molecular network. Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP) are methods to explain individual decisions of deep learning models. We used both GCNN+LRP and GCNN+SHAP techniques to construct feature sets by aggregating individual explanations. We suggest a methodology to systematically and quantitatively analyze the stability, the impact on the classification performance, and the interpretability of the selected feature sets. We used this methodology to compare GCNN+LRP to GCNN+SHAP and to more classical ML-based feature selection approaches. Utilizing a large breast cancer gene expression dataset we show that, while feature selection with SHAP is useful in applications where selected features have to be impactful for classification performance, among all studied methods GCNN+LRP delivers the most stable (reproducible) and interpretable gene lists.
Collapse
|
2
|
Weighted Combination of Łukasiewicz implication and Fuzzy Jaccard similarity in Hybrid Ensemble Framework (WCLFJHEF) for Gene Selection. Comput Biol Med 2024; 170:107981. [PMID: 38262204 DOI: 10.1016/j.compbiomed.2024.107981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 01/02/2024] [Accepted: 01/12/2024] [Indexed: 01/25/2024]
Abstract
A framework is developed for gene expression analysis by introducing fuzzy Jaccard similarity (FJS) and combining Łukasiewicz implication with it through weights in hybrid ensemble framework (WCLFJHEF) for gene selection in cancer. The method is called weighted combination of Łukasiewicz implication and fuzzy Jaccard similarity in hybrid ensemble framework (WCLFJHEF). While the fuzziness in Jaccard similarity is incorporated by using the existing Gödel fuzzy logic, the weights are obtained by maximizing the average F-score of selected genes in classifying the cancer patients. The patients are first divided into different clusters, based on the number of patient groups, using average linkage agglomerative clustering and a new score, called WCLFJ (weighted combination of Łukasiewicz implication and fuzzy Jaccard similarity). The genes are then selected from each cluster separately using filter based Relief-F and wrapper based SVMRFE (Support Vector Machine with Recursive Feature Elimination). A gene (feature) pool is created by considering the union of selected features for all the clusters. A set of informative genes is selected from the pool using sequential backward floating search (SBFS) algorithm. Patients are then classified using Naïve Bayes'(NB) and Support Vector Machine (SVM) separately, using the selected genes and the related F-scores are calculated. The weights in WCLFJ are then updated iteratively to maximize the average F-score obtained from the results of the classifier. The effectiveness of WCLFJHEF is demonstrated on six gene expression datasets. The average values of accuracy, F-score, recall, precision and MCC over all the datasets, are 95%, 94%, 94%, 94%, and 90%, respectively. The explainability of the selected genes is shown using SHapley Additive exPlanations (SHAP) values and this information is further used to rank them. The relevance of the selected gene set are biologically validated using the KEGG Pathway, Gene Ontology (GO), and existing literatures. It is seen that the genes that are selected by WCLFJHEF are candidates for genomic alterations in the various cancer types. The source code of WCLFJHEF is available at http://www.isical.ac.in/~shubhra/WCLFJHEF.html.
Collapse
|
3
|
A Data-Distribution and Successive Spline Points based discretization approach for evolving gene regulatory networks from scRNA-Seq time-series data using Cartesian Genetic Programming. Biosystems 2024; 236:105126. [PMID: 38278505 DOI: 10.1016/j.biosystems.2024.105126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 11/18/2023] [Accepted: 01/19/2024] [Indexed: 01/28/2024]
Abstract
The inference of gene regulatory networks (GRNs) is a widely addressed problem in Systems Biology. GRNs can be modeled as Boolean networks, which is the simplest approach for this task. However, Boolean models need binarized data. Several approaches have been developed for the discretization of gene expression data (GED). Also, the advance of data extraction technologies, such as single-cell RNA-Sequencing (scRNA-Seq), provides a new vision of gene expression and brings new challenges for dealing with its specificities, such as a large occurrence of zero data. This work proposes a new discretization approach for dealing with scRNA-Seq time-series data, named Distribution and Successive Spline Points Discretization (DSSPD), which considers the data distribution and a proper preprocessing step. Here, Cartesian Genetic Programming (CGP) is used to infer GRNs using the results of DSSPD. The proposal is compared with CGP with the standard data handling and five state-of-the-art algorithms on curated models and experimental data. The results show that the proposal improves the results of CGP in all tested cases and outperforms the state-of-the-art algorithms in most cases.
Collapse
|
4
|
SurvIAE: Survival prediction with Interpretable Autoencoders from Diffuse Large B-Cells Lymphoma gene expression data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 244:107966. [PMID: 38091844 DOI: 10.1016/j.cmpb.2023.107966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 11/24/2023] [Accepted: 12/01/2023] [Indexed: 01/26/2024]
Abstract
BACKGROUND In Diffuse Large B-Cell Lymphoma (DLBCL), several methodologies are emerging to derive novel biomarkers to be incorporated in the risk assessment. We realized a pipeline that relies on autoencoders (AE) and Explainable Artificial Intelligence (XAI) to stratify prognosis and derive a gene-based signature. METHODS AE was exploited to learn an unsupervised representation of the gene expression (GE) from three publicly available datasets, each with its own technology. Multi-layer perceptron (MLP) was used to classify prognosis from latent representation. GE data were preprocessed as normalized, scaled, and standardized. Four different AE architectures (Large, Medium, Small and Extra Small) were compared to find the most suitable for GE data. The joint AE-MLP classified patients on six different outcomes: overall survival at 12, 36, 60 months and progression-free survival (PFS) at 12, 36, 60 months. XAI techniques were used to derive a gene-based signature aimed at refining the Revised International Prognostic Index (R-IPI) risk, which was validated in a fourth independent publicly available dataset. We named our tool SurvIAE: Survival prediction with Interpretable AE. RESULTS From the latent space of AEs, we observed that scaled and standardized data reduced the batch effect. SurvIAE models outperformed R-IPI with Matthews Correlation Coefficient up to 0.42 vs. 0.18 for the validation-set (PFS36) and to 0.30 vs. 0.19 for the test-set (PFS60). We selected the SurvIAE-Small-PFS36 as the best model and, from its gene signature, we stratified patients in three risk groups: R-IPI Poor patients with High levels of GAB1, R-IPI Poor patients with Low levels of GAB1 or R-IPI Good/Very Good patients with Low levels of GPR132, and R-IPI Good/Very Good patients with High levels of GPR132. CONCLUSIONS SurvIAE showed the potential to derive a gene signature with translational purpose in DLBCL. The pipeline was made publicly available and can be reused for other pathologies.
Collapse
|
5
|
Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP. BMC Bioinformatics 2023; 24:427. [PMID: 37957576 PMCID: PMC10644641 DOI: 10.1186/s12859-023-05558-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 11/06/2023] [Indexed: 11/15/2023] Open
Abstract
BACKGROUND Although gene expression data play significant roles in biological and medical studies, their applications are hampered due to the difficulty and high expenses of gathering them through biological experiments. It is an urgent problem to generate high quality gene expression data with computational methods. WGAN-GP, a generative adversarial network-based method, has been successfully applied in augmenting gene expression data. However, mode collapse or over-fitting may take place for small training samples due to just one discriminator is adopted in the method. RESULTS In this study, an improved data augmentation approach MDWGAN-GP, a generative adversarial network model with multiple discriminators, is proposed. In addition, a novel method is devised for enriching training samples based on linear graph convolutional network. Extensive experiments were implemented on real biological data. CONCLUSIONS The experimental results have demonstrated that compared with other state-of-the-art methods, the MDWGAN-GP method can produce higher quality generated gene expression data in most cases.
Collapse
|
6
|
CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data. JOURNAL OF KING SAUD UNIVERSITY. COMPUTER AND INFORMATION SCIENCES 2023; 35:101731. [PMID: 38567001 PMCID: PMC7615789 DOI: 10.1016/j.jksuci.2023.101731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Aim Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. Method In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. Result Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. Conclusion Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.
Collapse
|
7
|
Inferring circadian gene regulatory relationships from gene expression data with a hybrid framework. BMC Bioinformatics 2023; 24:362. [PMID: 37752445 PMCID: PMC10521455 DOI: 10.1186/s12859-023-05458-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 08/30/2023] [Indexed: 09/28/2023] Open
Abstract
BACKGROUND The central biological clock governs numerous facets of mammalian physiology, including sleep, metabolism, and immune system regulation. Understanding gene regulatory relationships is crucial for unravelling the mechanisms that underlie various cellular biological processes. While it is possible to infer circadian gene regulatory relationships from time-series gene expression data, relying solely on correlation-based inference may not provide sufficient information about causation. Moreover, gene expression data often have high dimensions but a limited number of observations, posing challenges in their analysis. METHODS In this paper, we introduce a new hybrid framework, referred to as Circadian Gene Regulatory Framework (CGRF), to infer circadian gene regulatory relationships from gene expression data of rats. The framework addresses the challenges of high-dimensional data by combining the fuzzy C-means clustering algorithm with dynamic time warping distance. Through this approach, we efficiently identify the clusters of genes related to the target gene. To determine the significance of genes within a specific cluster, we employ the Wilcoxon signed-rank test. Subsequently, we use a dynamic vector autoregressive method to analyze the selected significant gene expression profiles and reveal directed causal regulatory relationships based on partial correlation. CONCLUSION The proposed CGRF framework offers a comprehensive and efficient solution for understanding circadian gene regulation. Circadian gene regulatory relationships are inferred from the gene expression data of rats based on the Aanat target gene. The results show that genes Pde10a, Atp7b, Prok2, Per1, Rhobtb3 and Dclk1 stand out, which have been known to be essential for the regulation of circadian activity. The potential relationships between genes Tspan15, Eprs, Eml5 and Fsbp with a circadian rhythm need further experimental research.
Collapse
|
8
|
ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest. BMC Bioinformatics 2023; 24:289. [PMID: 37468832 DOI: 10.1186/s12859-023-05412-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Accepted: 07/13/2023] [Indexed: 07/21/2023] Open
Abstract
BACKGROUND Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes. RESULTS In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype . CONCLUSIONS Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification.
Collapse
|
9
|
Differential Gene Expression Data Analysis of ASD Using Random Forest. Stud Health Technol Inform 2023; 302:1047-1051. [PMID: 37203578 DOI: 10.3233/shti230344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Autism spectrum disorder (ASD) is a developmental disability caused by differences in the brain regions. Analysis of differential expression (DE) of transcriptomic data allows for genome-wide analysis of gene expression changes related to ASD. De-novo mutations may play a vital role in ASD, but the list of genes involved is still far from complete. Differentially expressed genes (DEGs) are treated as candidate biomarkers and a small set of DEGs might be identified as biomarkers using either biological knowledge or data-driven approaches like machine learning and statistical analysis. In this study, we employed a machine learning-based approach to identify the differential gene expression between ASD and Typical Development (TD). The gene expression data of 15 ASD and 15 TD were obtained from the NCBI GEO database. Initially, we extracted the data and used a standard pipeline to pre-process the data. Further, Random Forest (RF) was used to discriminate genes between ASD and TD. We identified the top 10 prominent differential genes and compared them with the statistical test results. Our results show that the proposed RF model yields 5-fold cross-validation accuracy, sensitivity and specificity of 96.67%. Further, we obtained precision and F-measure scores of 97.5% and 96.57%, respectively. Moreover, we found 34 unique DEG chromosomal locations having influential contributions in identifying ASD from TD. We have also identified chr3:113322718-113322659 as the most significant contributing chromosomal location in discriminating ASD and TD. Our machine learning-based method of refining DE analysis is promising for finding biomarkers from gene expression profiles and prioritizing DEGs. Moreover, our study reported top 10 gene signatures for ASD may facilitate the development of reliable diagnosis and prognosis biomarkers for screening ASD.
Collapse
|
10
|
GeneViT: Gene Vision Transformer with Improved DeepInsight for cancer classification. Comput Biol Med 2023; 155:106643. [PMID: 36803792 DOI: 10.1016/j.compbiomed.2023.106643] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 01/03/2023] [Accepted: 02/05/2023] [Indexed: 02/09/2023]
Abstract
Analysis of gene expression data is crucial for disease prognosis and diagnosis. Gene expression data has high redundancy and noise that brings challenges in extracting disease information. Over the past decade, several conventional machine learning and deep learning models have been developed for classification of diseases using gene expressions. In recent years, vision transformer networks have shown promising performance in many fields due to their powerful attention mechanism that provides a better insight into the data characteristics. However, these network models have not been explored for gene expression analysis. In this paper, a method for classifying cancerous gene expression is presented that uses a Vision transformer. The proposed method first performs dimensionality reduction using a stacked autoencoder followed by an Improved DeepInsight algorithm that converts the data into image format. The data is then fed to the vision transformer for building the classification model. Performance of the proposed classification model is evaluated on ten benchmark datasets having binary classes or multiple classes. Its performance is also compared with nine existing classification models. The experimental results demonstrate that the proposed model outperforms existing methods. The t-SNE plots demonstrate the distinctive feature learning property of the model.
Collapse
|
11
|
A functional gene module identification algorithm in gene expression data based on genetic algorithm and gene ontology. BMC Genomics 2023; 24:76. [PMID: 36797662 PMCID: PMC9936134 DOI: 10.1186/s12864-023-09157-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 01/31/2023] [Indexed: 02/18/2023] Open
Abstract
Since genes do not function individually, the gene module is considered an important tool for interpreting gene expression profiles. In order to consider both functional similarity and expression similarity in module identification, GMIGAGO, a functional Gene Module Identification algorithm based on Genetic Algorithm and Gene Ontology, was proposed in this work. GMIGAGO is an overlapping gene module identification algorithm, which mainly includes two stages: In the first stage (initial identification of gene modules), Improved Partitioning Around Medoids Based on Genetic Algorithm (PAM-GA) is used for the initial clustering on gene expression profiling, and traditional gene co-expression modules can be obtained. Only similarity of expression levels is considered at this stage. In the second stage (optimization of functional similarity within gene modules), Genetic Algorithm for Functional Similarity Optimization (FSO-GA) is used to optimize gene modules based on gene ontology, and functional similarity within gene modules can be improved. Without loss of generality, we compared GMIGAGO with state-of-the-art gene module identification methods on six gene expression datasets, and GMIGAGO identified the gene modules with the highest functional similarity (much higher than state-of-the-art algorithms). GMIGAGO was applied in BRCA, THCA, HNSC, COVID-19, Stem, and Radiation datasets, and it identified some interesting modules which performed important biological functions. The hub genes in these modules could be used as potential targets for diseases or radiation protection. In summary, GMIGAGO has excellent performance in mining molecular mechanisms, and it can also identify potential biomarkers for individual precision therapy.
Collapse
|
12
|
DM-MOGA: a multi-objective optimization genetic algorithm for identifying disease modules of non-small cell lung cancer. BMC Bioinformatics 2023; 24:13. [PMID: 36624376 PMCID: PMC9830734 DOI: 10.1186/s12859-023-05136-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2022] [Accepted: 01/04/2023] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Constructing molecular interaction networks from microarray data and then identifying disease module biomarkers can provide insight into the underlying pathogenic mechanisms of non-small cell lung cancer. A promising approach for identifying disease modules in the network is community detection. RESULTS In order to identify disease modules from gene co-expression networks, a community detection method is proposed based on multi-objective optimization genetic algorithm with decomposition. The method is named DM-MOGA and possesses two highlights. First, the boundary correction strategy is designed for the modules obtained in the process of local module detection and pre-simplification. Second, during the evolution, we introduce Davies-Bouldin index and clustering coefficient as fitness functions which are improved and migrated to weighted networks. In order to identify modules that are more relevant to diseases, the above strategies are designed to consider the network topology of genes and the strength of connections with other genes at the same time. Experimental results of different gene expression datasets of non-small cell lung cancer demonstrate that the core modules obtained by DM-MOGA are more effective than those obtained by several other advanced module identification methods. CONCLUSIONS The proposed method identifies disease-relevant modules by optimizing two novel fitness functions to simultaneously consider the local topology of each gene and its connection strength with other genes. The association of the identified core modules with lung cancer has been confirmed by pathway and gene ontology enrichment analysis.
Collapse
|
13
|
Improved Regularized Multi-class Logistic Regression for Gene Classification with Optimal Kernel PCA and HC Algorithm. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2023; 1424:273-279. [PMID: 37486504 DOI: 10.1007/978-3-031-31982-2_31] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/25/2023]
Abstract
A significant challenge in high-dimensional and big data analysis is related to the classification and prediction of the variables of interest. The massive genetic datasets are complex. Gene expression datasets are enriched with useful genes that are associated with specific diseases such as cancer. In this study, we used two gene expression datasets from the Gene Expression Omnibus and preprocessed them before classification. We used optimal kernel principal component analysis in which the optimal kernel function was chosen for dataset dimensionality reduction and extraction of the most important features. The gene sets with a high validity index were collected using a combined hieratical clustering and optimal kernel principal component analysis (KHC-RLR) algorithm. Logistic regression is one of the most common methods for classification, and it has been shown to be a useful classification approach for gene expression data analysis. In this study, we used multi-class logistic regression to classify the collected gene sets. We found that ordinary logistic regression caused a major overfitting problem; therefore, we used regularized multi-class logistic regression to classify the gene sets. The proposed KHC-RLR algorithm showed a high performance and satisfied accuracy measures.
Collapse
|
14
|
A survey on gene expression data analysis using deep learning methods for cancer diagnosis. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2023; 177:1-13. [PMID: 35988771 DOI: 10.1016/j.pbiomolbio.2022.08.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 08/09/2022] [Accepted: 08/12/2022] [Indexed: 02/07/2023]
Abstract
Gene Expression Data is the biological data to extract meaningful hidden information from the gene dataset. This gene information is used for disease diagnosis especially in cancer treatment based on the variations in gene expression levels. DNA microarray is an efficient method for gene expression classification and prediction of cancer disease for specific types of cancer. Due to the abundance of computing power, deep learning (DL) has become a widespread technique in the healthcare sector. The gene expression dataset has a limited number of samples but a large number of features. Data augmentation is needed for gene expression datasets to overcome the dimensionality problem in gene data. It is a technique to generating the synthetic samples to increase the diversity of data. Deep learning methods are designed to learn and extract the features that come from the raw input data in the form of multidimensional arrays. This paper reviews the existing research in deep learning techniques like Feed Forward Neural Network (FFN), Convolutional Neural Network (CNN), Autoencoder (AE) and Recurrent Neural Network (RNN) for the classification and prediction of cancer disease and its types through gene expression data analysis.
Collapse
|
15
|
A binary biclustering algorithm based on the adjacency difference matrix for gene expression data analysis. BMC Bioinformatics 2022; 23:381. [PMID: 36123637 PMCID: PMC9484244 DOI: 10.1186/s12859-022-04842-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Accepted: 07/14/2022] [Indexed: 11/20/2022] Open
Abstract
Biclustering algorithm is an effective tool for processing gene expression datasets. There are two kinds of data matrices, binary data and non-binary data, which are processed by biclustering method. A binary matrix is usually converted from pre-processed gene expression data, which can effectively reduce the interference from noise and abnormal data, and is then processed using a biclustering algorithm. However, biclustering algorithms of dealing with binary data have a poor balance between running time and performance. In this paper, we propose a new biclustering algorithm called the Adjacency Difference Matrix Binary Biclustering algorithm (AMBB) for dealing with binary data to address the drawback. The AMBB algorithm constructs the adjacency matrix based on the adjacency difference values, and the submatrix obtained by continuously updating the adjacency difference matrix is called a bicluster. The adjacency matrix allows for clustering of gene that undergo similar reactions under different conditions into clusters, which is important for subsequent genes analysis. Meanwhile, experiments on synthetic and real datasets visually demonstrate that the AMBB algorithm has high practicability.
Collapse
|
16
|
Constrained Fourier estimation of short-term time-series gene expression data reduces noise and improves clustering and gene regulatory network predictions. BMC Bioinformatics 2022; 23:330. [PMID: 35945515 PMCID: PMC9364503 DOI: 10.1186/s12859-022-04839-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 07/12/2022] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Biological data suffers from noise that is inherent in the measurements. This is particularly true for time-series gene expression measurements. Nevertheless, in order to to explore cellular dynamics, scientists employ such noisy measurements in predictive and clustering tools. However, noisy data can not only obscure the genes temporal patterns, but applying predictive and clustering tools on noisy data may yield inconsistent, and potentially incorrect, results. RESULTS To reduce the noise of short-term (< 48 h) time-series expression data, we relied on the three basic temporal patterns of gene expression: waves, impulses and sustained responses. We constrained the estimation of the true signals to these patterns by estimating the parameters of first and second-order Fourier functions and using the nonlinear least-squares trust-region optimization technique. Our approach lowered the noise in at least 85% of synthetic time-series expression data, significantly more than the spline method ([Formula: see text]). When the data contained a higher signal-to-noise ratio, our method allowed downstream network component analyses to calculate consistent and accurate predictions, particularly when the noise variance was high. Conversely, these tools led to erroneous results from untreated noisy data. Our results suggest that at least 5-7 time points are required to efficiently de-noise logarithmic scaled time-series expression data. Investing in sampling additional time points provides little benefit to clustering and prediction accuracy. CONCLUSIONS Our constrained Fourier de-noising method helps to cluster noisy gene expression and interpret dynamic gene networks more accurately. The benefit of noise reduction is large and can constitute the difference between a successful application and a failing one.
Collapse
|
17
|
tensorGSEA: Detecting Differential Pathways in Type 2 Diabetes via Tensor-Based Data Reconstruction. Interdiscip Sci 2022; 14:520-531. [PMID: 35195883 DOI: 10.1007/s12539-022-00506-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 01/24/2022] [Accepted: 02/07/2022] [Indexed: 06/14/2023]
Abstract
Detecting significant signaling pathways in disease progression highlights the dysfunctions and pathogenic mechanisms of complex disease development. Since tensor decomposition has been proven effective for multi-dimensional data representation and reconstruction, differences between original and tensor-processed data are expected to extract crucial information and differential indication. This paper provides a tensor-based gene set enrichment analysis, called tensorGSEA, based on a data reconstruction method to identify relevant significant pathways during disease development. As a proof-of-concept study, we identify the differential pathways of diabetes in rats. Specifically, we first arrange gene expression profiles of each documented pathway as tensors with three dimensions: genes, samples, and periods. Then we compress tensors into core tensors with lower ranks. The pathways with lower reconstruction rates are obtained after reconstructing gene expression profiles in another state via these cores. Thus, differences underlying pathways are extracted by cross-state data reconstruction between controls and diseases. The experiments reveal several critical pathways with diabetes-specific functions which otherwise cannot be identified by alternative methods. Our proposed tensorGSEA is efficient in evaluating pathways by achieving their empirical statistical significance, respectively. The classification experiments demonstrate that the selected pathways can be implemented as biomarkers to identify the diabetic state. The code of tensorGSEA is available at https://github.com/zhxr37/tensorGSEA .
Collapse
|
18
|
Identification of Alzheimer associated differentially expressed gene through microarray data and transfer learning-based image analysis. Neurosci Lett 2022; 766:136357. [PMID: 34808269 DOI: 10.1016/j.neulet.2021.136357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 11/16/2021] [Indexed: 11/28/2022]
Abstract
Major factors contribute to mental stress and enhance the progression of late-onset Alzheimer's disease (AD). The factors that lead to neurodegeneration, such as tau protein hyperphosphorylation and increased amyloid-beta production, can be mimicked in animal stress models. The present study identifies differentially expressed genes (DEGs) data and its corresponding predictive image analysis in rat models. The gene expression profile of GSE72062, GSE85162, GSE143951 and GSE85238 was downloaded from NCBI, GEO archive to analyse DEGs. Functional enrichment and pathway relationship networks, gene signal, protein interaction and micro-RNA interaction DEGs networks were constructed and investigated. The image analysis of histopathological slides of rat brain images corresponding to AD microarray-based DEGs profile was undertaken using the convolution neural networks (ConvNets) model. Enrichment of network in terms of GO concluded with 10 DEGs, namely ARHGAP32, GNA11, NR5A1, GNAT3, FOSL1, HELZ2, NMUR2, BDKRB1, RPL3L and RPL39L as potential gene targets to control neurodegeneration and progression of sporadic AD. The image analysis of AD microarray-based DEGs profile builds a successful predictive model of 89% and 61% training and test accuracy with a minimum of 2.480% loss using transfer learning, VGG16 model. Interestingly, the ARHGAP32 gene, a Rho GTPase activating class, was identified to have a functional relationship with two significant genes BCL2 and MMP9, that are well explored in AD. The current investigation upgrades the traditional pre-clinical AD research using microarray data analysis and ConvNets. The model successfully predicts DEG from histopathology slides of rat brain samples, paving the way for image analysis to determine the underlying molecular makeup of the test samples.
Collapse
|
19
|
Drug repositioning based on gene expression data for human HER2-positive breast cancer. Arch Biochem Biophys 2021; 712:109043. [PMID: 34597657 DOI: 10.1016/j.abb.2021.109043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 09/09/2021] [Accepted: 09/21/2021] [Indexed: 10/20/2022]
Abstract
Human epidermal growth factor receptor 2 (HER2)-positive breast cancer represents approximately 15-30% of all invasive breast cancers. Despite the recent advances in therapeutic practices of HER2 subtype, drug resistance and tumor recurrence still have remained as major problems. Drug discovery is a long and difficult process, so the aim of this study is to find potential new application for existing therapeutic agents. Gene expression data for breast invasive carcinoma were retrieved from The Cancer Genome Atlas (TCGA) database. The normal and tumor samples were analyzed using Linear Models for Microarray Data (LIMMA) R package in order to find the differentially expressed genes (DEGs). These genes were used as entry for the library of integrated network-based cellular signatures (LINCS) L1000CDS2 software and suggested 24 repurposed drugs. According to the obtained results, some of these drugs including vorinostat, mocetinostat, alvocidib, CGP-60474, BMS-387032, AT-7519, and curcumin have significant functional similarity and structural correlation with FDA-approved breast cancer drugs. Based on the drug-target network, which consisted of the repurposed drugs and their target genes, the aforementioned drugs had the highest degrees. Moreover, the experimental approach verified curcumin as an effective therapeutic agent for HER2 positive breast cancer. Hence, our work suggested that some repurposed drugs based on gene expression data can be noticed as potential drugs for the treatment of HER2-positive breast cancer.
Collapse
|
20
|
An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples. PeerJ Comput Sci 2021; 7:e671. [PMID: 34616883 PMCID: PMC8459790 DOI: 10.7717/peerj-cs.671] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 07/20/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. METHODS In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. RESULTS To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.
Collapse
|
21
|
Correlation-based joint feature screening for semi-competing risks outcomes with application to breast cancer data. Stat Methods Med Res 2021; 30:2428-2446. [PMID: 34519231 DOI: 10.1177/09622802211037071] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Ultrahigh-dimensional gene features are often collected in modern cancer studies in which the number of gene features p is extremely larger than sample size n. While gene expression patterns have been shown to be related to patients' survival in microarray-based gene expression studies, one has to deal with the challenges of ultrahigh-dimensional genetic predictors for survival predicting and genetic understanding of the disease in precision medicine. The problem becomes more complicated when two types of survival endpoints, distant metastasis-free survival and overall survival, are of interest in the study and outcome data can be subject to semi-competing risks due to the fact that distant metastasis-free survival is possibly censored by overall survival but not vice versa. Our focus in this paper is to extract important features, which have great impacts on both distant metastasis-free survival and overall survival jointly, from massive gene expression data in the semi-competing risks setting. We propose a model-free screening method based on the ranking of the correlation between gene features and the joint survival function of two endpoints. The method accounts for the relationship between two endpoints in a simply defined utility measure that is easy to understand and calculate. We show its favorable theoretical properties such as the sure screening and ranking consistency, and evaluate its finite sample performance through extensive simulation studies. Finally, an application to classifying breast cancer data clearly demonstrates the utility of the proposed method in practice.
Collapse
|
22
|
Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data. Comput Biol Med 2021; 135:104540. [PMID: 34153791 DOI: 10.1016/j.compbiomed.2021.104540] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 05/14/2021] [Accepted: 05/26/2021] [Indexed: 11/19/2022]
Abstract
BACKGROUND AND OBJECTIVE Cancer is a serious global disease due to its high mortality, and the key to effective treatment is accurate diagnosis. However, limited by sampling difficulty and actual sample size in clinical practice, data imbalance is a common problem in cancer diagnosis, while most conventional classification methods assume balanced data distribution. Therefore, addressing the imbalanced learning problem to improve the predictive performance of cancer diagnosis is significant. METHODS In the study, we dissect the data imbalance prevalent in cancer gene expression data and present an improved deep learning based Wasserstein generative adversarial network (WGAN) model, which provides a reliable training progress indicator and deeply explores the characteristics of data. The WGAN generates new samples from the minority class and solves the imbalance problem at the data level. RESULTS We analyze three publicly available data sets on RNA-seq of three kinds of cancer using the proposed WGAN and compare the results with those from two commonly adopted sampling methods. According to the results, through addressing the data imbalance problem, the balanced data distribution and the expanding sample size increase the prediction accuracy in all three data sets. CONCLUSIONS Therefore, the proposed WGAN method is superior in solving the imbalanced learning problem of gene expression data, providing significantly better prediction performance in cancer diagnosis.
Collapse
|
23
|
Protein functional module identification method combining topological features and gene expression data. BMC Genomics 2021; 22:423. [PMID: 34103008 PMCID: PMC8185953 DOI: 10.1186/s12864-021-07620-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 04/08/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The study of protein complexes and protein functional modules has become an important method to further understand the mechanism and organization of life activities. The clustering algorithms used to analyze the information contained in protein-protein interaction network are effective ways to explore the characteristics of protein functional modules. RESULTS This paper conducts an intensive study on the problems of low recognition efficiency and noise in the overlapping structure of protein functional modules, based on topological characteristics of PPI network. Developing a protein function module recognition method ECTG based on Topological Features and Gene expression data for Protein Complex Identification. CONCLUSIONS The algorithm can effectively remove the noise data reflected by calculating the topological structure characteristic values in the PPI network through the similarity of gene expression patterns, and also properly use the information hidden in the gene expression data. The experimental results show that the ECTG algorithm can detect protein functional modules better.
Collapse
|
24
|
Overview of Gene Regulatory Network Inference Based on Differential Equation Models. Curr Protein Pept Sci 2021; 21:1054-1059. [PMID: 32053072 DOI: 10.2174/1389203721666200213103350] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2019] [Revised: 11/22/2019] [Accepted: 01/09/2020] [Indexed: 11/22/2022]
Abstract
Reconstruction of gene regulatory networks (GRN) plays an important role in understanding the complexity, functionality and pathways of biological systems, which could support the design of new drugs for diseases. Because differential equation models are flexible androbust, these models have been utilized to identify biochemical reactions and gene regulatory networks. This paper investigates the differential equation models for reverse engineering gene regulatory networks. We introduce three kinds of differential equation models, including ordinary differential equation (ODE), time-delayed differential equation (TDDE) and stochastic differential equation (SDE). ODE models include linear ODE, nonlinear ODE and S-system model. We also discuss the evolutionary algorithms, which are utilized to search the optimal structures and parameters of differential equation models. This investigation could provide a comprehensive understanding of differential equation models, and lead to the discovery of novel differential equation models.
Collapse
|
25
|
Learning gene regulatory networks using gaussian process emulator and graphical LASSO. J Bioinform Comput Biol 2021; 19:2150007. [PMID: 33930997 DOI: 10.1142/s0219720021500074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Large amounts of research efforts have been focused on learning gene regulatory networks (GRNs) based on gene expression data to understand the functional basis of a living organism. Under the assumption that the joint distribution of the gene expressions of interest is a multivariate normal distribution, such networks can be constructed by assessing the nonzero elements of the inverse covariance matrix, the so-called precision matrix or concentration matrix. This may not reflect the true connectivity between genes by considering just pairwise linear correlations. To relax this limitative constraint, we employ Gaussian process (GP) model which is well known as computationally efficient non-parametric Bayesian machine learning technique. GPs are among a class of methods known as kernel machines which can be used to approximate complex problems by tuning their hyperparameters. In fact, GP creates the ability to use the capacity and potential of different kernels in constructing precision matrix and GRNs. In this paper, in the first step, we choose the GP with appropriate kernel to learn the considered GRNs from the observed genetic data, and then we estimate kernel hyperparameters using rule-of-thumb technique. Using these hyperparameters, we can also control the degree of sparseness in the precision matrix. Then we obtain kernel-based precision matrix similar to GLASSO to construct kernel-based GRN. The findings of our research are used to construct GRNs with high performance, for different species of Drosophila fly rather than simply using the assumption of multivariate normal distribution, and the GPs, despite the use of the kernels capacity, have a much better performance than the multivariate Gaussian distribution assumption.
Collapse
|
26
|
An efficient ensemble method for missing value imputation in microarray gene expression data. BMC Bioinformatics 2021; 22:188. [PMID: 33849444 PMCID: PMC8045198 DOI: 10.1186/s12859-021-04109-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 03/29/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way.
Collapse
|
27
|
Explaining decisions of graph convolutional neural networks: patient-specific molecular subnetworks responsible for metastasis prediction in breast cancer. Genome Med 2021; 13:42. [PMID: 33706810 PMCID: PMC7953710 DOI: 10.1186/s13073-021-00845-7] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 02/05/2021] [Indexed: 12/19/2022] Open
Abstract
Background Contemporary deep learning approaches show cutting-edge performance in a variety of complex prediction tasks. Nonetheless, the application of deep learning in healthcare remains limited since deep learning methods are often considered as non-interpretable black-box models. However, the machine learning community made recent elaborations on interpretability methods explaining data point-specific decisions of deep learning techniques. We believe that such explanations can assist the need in personalized precision medicine decisions via explaining patient-specific predictions. Methods Layer-wise Relevance Propagation (LRP) is a technique to explain decisions of deep learning methods. It is widely used to interpret Convolutional Neural Networks (CNNs) applied on image data. Recently, CNNs started to extend towards non-Euclidean domains like graphs. Molecular networks are commonly represented as graphs detailing interactions between molecules. Gene expression data can be assigned to the vertices of these graphs. In other words, gene expression data can be structured by utilizing molecular network information as prior knowledge. Graph-CNNs can be applied to structured gene expression data, for example, to predict metastatic events in breast cancer. Therefore, there is a need for explanations showing which part of a molecular network is relevant for predicting an event, e.g., distant metastasis in cancer, for each individual patient. Results We extended the procedure of LRP to make it available for Graph-CNN and tested its applicability on a large breast cancer dataset. We present Graph Layer-wise Relevance Propagation (GLRP) as a new method to explain the decisions made by Graph-CNNs. We demonstrate a sanity check of the developed GLRP on a hand-written digits dataset and then apply the method on gene expression data. We show that GLRP provides patient-specific molecular subnetworks that largely agree with clinical knowledge and identify common as well as novel, and potentially druggable, drivers of tumor progression. Conclusions The developed method could be potentially highly useful on interpreting classification results in the context of different omics data and prior knowledge molecular networks on the individual patient level, as for example in precision medicine approaches or a molecular tumor board. Supplementary Information The online version contains supplementary material available at (10.1186/s13073-021-00845-7).
Collapse
|
28
|
High performance logistic regression for privacy-preserving genome analysis. BMC Med Genomics 2021; 14:23. [PMID: 33472626 PMCID: PMC7818577 DOI: 10.1186/s12920-020-00869-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 12/30/2020] [Indexed: 11/30/2022] Open
Abstract
Background In biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training machine learning models on the joint data without violating privacy is a major technology challenge that can be addressed by combining techniques from machine learning and cryptography. When collaboratively training machine learning models with the cryptographic technique named secure multi-party computation, the price paid for keeping the data of the owners private is an increase in computational cost and runtime. A careful choice of machine learning techniques, algorithmic and implementation optimizations are a necessity to enable practical secure machine learning over distributed data sets. Such optimizations can be tailored to the kind of data and Machine Learning problem at hand. Methods Our setup involves secure two-party computation protocols, along with a trusted initializer that distributes correlated randomness to the two computing parties. We use a gradient descent based algorithm for training a logistic regression like model with a clipped ReLu activation function, and we break down the algorithm into corresponding cryptographic protocols. Our main contributions are a new protocol for computing the activation function that requires neither secure comparison protocols nor Yao’s garbled circuits, and a series of cryptographic engineering optimizations to improve the performance. Results For our largest gene expression data set, we train a model that requires over 7 billion secure multiplications; the training completes in about 26.90 s in a local area network. The implementation in this work is a further optimized version of the implementation with which we won first place in Track 4 of the iDASH 2019 secure genome analysis competition. Conclusions In this paper, we present a secure logistic regression training protocol and its implementation, with a new subprotocol to securely compute the activation function. To the best of our knowledge, we present the fastest existing secure multi-party computation implementation for training logistic regression models on high dimensional genome data distributed across a local area network.
Collapse
|
29
|
Biomarker discovery for predicting spontaneous preterm birth from gene expression data by regularized logistic regression. Comput Struct Biotechnol J 2020; 18:3434-3446. [PMID: 33294138 PMCID: PMC7689379 DOI: 10.1016/j.csbj.2020.10.028] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Revised: 10/24/2020] [Accepted: 10/25/2020] [Indexed: 01/23/2023] Open
Abstract
In this work, we provide a computational method of regularized logistic regression for discovering biomarkers of spontaneous preterm birth (SPTB) from gene expression data. The successful identification of SPTB biomarkers will greatly benefit the interference of infant gestational age for reducing the risks of pregnant women and preemies. In recent years, various approaches have been proposed for the feature selection of identifying the subset of meaningful genes that can achieve accurate classification for disease samples from controls. Here, we comprehensively summarize the regularized logistic regression with seven effective penalties developed for the selection of strongly indicative genes of SPTB from microarray data. We compare their properties and assess their classification performances in multiple datasets. It shows that elastic net, lasso,L 1 / 2 and SCAD penalties get the better performance than others and can be successfully used to identify biomarkers of SPTB. Particularly, we make a functional enrichment analysis on these biomarkers and construct a logistic regression classifier based on them. The classifier generates an indicator of preterm risk score (PRS) for predicting SPTB. Based on the trained predictor, we verify the identified biomarkers on an independent dataset. The biomarkers achieve the AUC value of 0.933 in the SPTB classification. The results demonstrate the effectiveness and efficiency of the built-up strategy of biomarker discovery with regularized logistic regression. Obviously, the proposed method of discovering biomarkers for SPTB can be easily extended for other complex diseases.
Collapse
|
30
|
Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data. BMC Bioinformatics 2020; 21:440. [PMID: 33028196 PMCID: PMC7541255 DOI: 10.1186/s12859-020-03797-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Accepted: 10/01/2020] [Indexed: 01/08/2023] Open
Abstract
Background Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. Results We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. Conclusion The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.
Collapse
|
31
|
LogicNet: probabilistic continuous logics in reconstructing gene regulatory networks. BMC Bioinformatics 2020; 21:318. [PMID: 32690031 PMCID: PMC7372900 DOI: 10.1186/s12859-020-03651-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Accepted: 07/10/2020] [Indexed: 11/10/2022] Open
Abstract
Background Gene Regulatory Networks (GRNs) have been previously studied by using Boolean/multi-state logics. While the gene expression values are usually scaled into the range [0, 1], these GRN inference methods apply a threshold to discretize the data, resulting in missing information. Most of studies apply fuzzy logics to infer the logical gene-gene interactions from continuous data. However, all these approaches require an a priori known network structure. Results Here, by introducing a new probabilistic logic for continuous data, we propose a novel logic-based approach (called the LogicNet) for the simultaneous reconstruction of the GRN structure and identification of the logics among the regulatory genes, from the continuous gene expression data. In contrast to the previous approaches, the LogicNet does not require an a priori known network structure to infer the logics. The proposed probabilistic logic is superior to the existing fuzzy logics and is more relevant to the biological contexts than the fuzzy logics. The performance of the LogicNet is superior to that of several Mutual Information-based and regression-based tools for reconstructing GRNs. Conclusions The LogicNet reconstructs GRNs and logic functions without requiring prior knowledge of the network structure. Moreover, in another application, the LogicNet can be applied for logic function detection from the known regulatory genes-target interactions. We also conclude that computational modeling of the logical interactions among the regulatory genes significantly improves the GRN reconstruction accuracy.
Collapse
|
32
|
Sparse relative risk regression models. Biostatistics 2020; 21:e131-e147. [PMID: 30380025 PMCID: PMC7868056 DOI: 10.1093/biostatistics/kxy060] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2017] [Revised: 09/20/2018] [Accepted: 09/24/2018] [Indexed: 11/15/2022] Open
Abstract
Clinical studies where patients are routinely screened for many genomic features are becoming more routine. In principle, this holds the promise of being able to find genomic signatures for a particular disease. In particular, cancer survival is thought to be closely linked to the genomic constitution of the tumor. Discovering such signatures will be useful in the diagnosis of the patient, may be used for treatment decisions and, perhaps, even the development of new treatments. However, genomic data are typically noisy and high-dimensional, not rarely outstripping the number of patients included in the study. Regularized survival models have been proposed to deal with such scenarios. These methods typically induce sparsity by means of a coincidental match of the geometry of the convex likelihood and a (near) non-convex regularizer. The disadvantages of such methods are that they are typically non-invariant to scale changes of the covariates, they struggle with highly correlated covariates, and they have a practical problem of determining the amount of regularization. In this article, we propose an extension of the differential geometric least angle regression method for sparse inference in relative risk regression models. A software implementation of our method is available on github (https://github.com/LuigiAugugliaro/dgcox).
Collapse
|
33
|
Comparison of pathway and gene-level models for cancer prognosis prediction. BMC Bioinformatics 2020; 21:76. [PMID: 32111152 PMCID: PMC7048092 DOI: 10.1186/s12859-020-3423-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Accepted: 02/17/2020] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Cancer prognosis prediction is valuable for patients and clinicians because it allows them to appropriately manage care. A promising direction for improving the performance and interpretation of expression-based predictive models involves the aggregation of gene-level data into biological pathways. While many studies have used pathway-level predictors for cancer survival analysis, a comprehensive comparison of pathway-level and gene-level prognostic models has not been performed. To address this gap, we characterized the performance of penalized Cox proportional hazard models built using either pathway- or gene-level predictors for the cancers profiled in The Cancer Genome Atlas (TCGA) and pathways from the Molecular Signatures Database (MSigDB). RESULTS When analyzing TCGA data, we found that pathway-level models are more parsimonious, more robust, more computationally efficient and easier to interpret than gene-level models with similar predictive performance. For example, both pathway-level and gene-level models have an average Cox concordance index of ~ 0.85 for the TCGA glioma cohort, however, the gene-level model has twice as many predictors on average, the predictor composition is less stable across cross-validation folds and estimation takes 40 times as long as compared to the pathway-level model. When the complex correlation structure of the data is broken by permutation, the pathway-level model has greater predictive performance while still retaining superior interpretative power, robustness, parsimony and computational efficiency relative to the gene-level models. For example, the average concordance index of the pathway-level model increases to 0.88 while the gene-level model falls to 0.56 for the TCGA glioma cohort using survival times simulated from uncorrelated gene expression data. CONCLUSION The results of this study show that when the correlations among gene expression values are low, pathway-level analyses can yield better predictive performance, greater interpretative power, more robust models and less computational cost relative to a gene-level model. When correlations among genes are high, a pathway-level analysis provides equivalent predictive power compared to a gene-level analysis while retaining the advantages of interpretability, robustness and computational efficiency.
Collapse
|
34
|
PCA via joint graph Laplacian and sparse constraint: Identification of differentially expressed genes and sample clustering on gene expression data. BMC Bioinformatics 2019; 20:716. [PMID: 31888433 PMCID: PMC6936054 DOI: 10.1186/s12859-019-3229-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In recent years, identification of differentially expressed genes and sample clustering have become hot topics in bioinformatics. Principal Component Analysis (PCA) is a widely used method in gene expression data. However, it has two limitations: first, the geometric structure hidden in data, e.g., pair-wise distance between data points, have not been explored. This information can facilitate sample clustering; second, the Principal Components (PCs) determined by PCA are dense, leading to hard interpretation. However, only a few of genes are related to the cancer. It is of great significance for the early diagnosis and treatment of cancer to identify a handful of the differentially expressed genes and find new cancer biomarkers. Results In this study, a new method gLSPCA is proposed to integrate both graph Laplacian and sparse constraint into PCA. gLSPCA on the one hand improves the clustering accuracy by exploring the internal geometric structure of the data, on the other hand identifies differentially expressed genes by imposing a sparsity constraint on the PCs. Conclusions Experiments of gLSPCA and its comparison with existing methods, including Z-SPCA, GPower, PathSPCA, SPCArt, gLPCA, are performed on real datasets of both pancreatic cancer (PAAD) and head & neck squamous carcinoma (HNSC). The results demonstrate that gLSPCA is effective in identifying differentially expressed genes and sample clustering. In addition, the applications of gLSPCA on these datasets provide several new clues for the exploration of causative factors of PAAD and HNSC.
Collapse
|
35
|
Multi-cancer samples clustering via graph regularized low-rank representation method under sparse and symmetric constraints. BMC Bioinformatics 2019; 20:718. [PMID: 31888442 PMCID: PMC6936083 DOI: 10.1186/s12859-019-3231-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Background Identifying different types of cancer based on gene expression data has become hotspot in bioinformatics research. Clustering cancer gene expression data from multiple cancers to their own class is a significance solution. However, the characteristics of high-dimensional and small samples of gene expression data and the noise of the data make data mining and research difficult. Although there are many effective and feasible methods to deal with this problem, the possibility remains that these methods are flawed. Results In this paper, we propose the graph regularized low-rank representation under symmetric and sparse constraints (sgLRR) method in which we introduce graph regularization based on manifold learning and symmetric sparse constraints into the traditional low-rank representation (LRR). For the sgLRR method, by means of symmetric constraint and sparse constraint, the effect of raw data noise on low-rank representation is alleviated. Further, sgLRR method preserves the important intrinsic local geometrical structures of the raw data by introducing graph regularization. We apply this method to cluster multi-cancer samples based on gene expression data, which improves the clustering quality. First, the gene expression data are decomposed by sgLRR method. And, a lowest rank representation matrix is obtained, which is symmetric and sparse. Then, an affinity matrix is constructed to perform the multi-cancer sample clustering by using a spectral clustering algorithm, i.e., normalized cuts (Ncuts). Finally, the multi-cancer samples clustering is completed. Conclusions A series of comparative experiments demonstrate that the sgLRR method based on low rank representation has a great advantage and remarkable performance in the clustering of multi-cancer samples.
Collapse
|
36
|
Utilizing Molecular Network Information via Graph Convolutional Neural Networks to Predict Metastatic Event in Breast Cancer. Stud Health Technol Inform 2019; 267:181-186. [PMID: 31483271 DOI: 10.3233/shti190824] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Gene expression data is commonly available in cancer research and provides a snapshot of the molecular status of a specific tumor tissue. This high-dimensional data can be analyzed for diagnoses, prognoses, and to suggest treatment options. Machine learning based methods are widely used for such analysis. Recently, a set of deep learning techniques was successfully applied in different domains including bioinformatics. One of these prominent techniques are convolutional neural networks (CNN). Currently, CNNs are extending to non-Euclidean domains like graphs. Molecular networks are commonly represented as graphs detailing interactions between molecules. Gene expression data can be assigned to the vertices of these graphs, and the edges can depict interactions, regulations and signal flow. In other words, gene expression data can be structured by utilizing molecular network information as prior knowledge. Here, we applied graph CNN to gene expression data of breast cancer patients to predict the occurrence of metastatic events. To structure the data we utilized a protein-protein interaction network. We show that the graph CNN exploiting the prior knowledge is able to provide classification improvements for the prediction of metastatic events compared to existing methods.
Collapse
|
37
|
Abstract
Inferring gene regulatory networks from expression data is a very challenging problem that has raised the interest of the scientific community. Different algorithms have been proposed to try to solve this issue, but it has been shown that different methods have some particular biases and strengths, and none of them is the best across all types of data and datasets. As a result, the idea of aggregating various network inferences through a consensus mechanism naturally arises. In this chapter, a common framework to standardize already proposed consensus methods is presented, and based on this framework different proposals are introduced and analyzed in two different scenarios: Homogeneous and Heterogeneous. The first scenario reflects situations where the networks to be aggregated are rather similar because they are obtained with inference algorithms working on the same data, whereas the second scenario deals with very diverse networks because various sources of data are used to generate the individual networks. A procedure for combining multiple network inference algorithms is analyzed in a systematic way. The results show that there is a very significant difference between these two scenarios, and that the best way to combine networks in the Heterogeneous scenario is not the most commonly used. We show in particular that aggregation in the Heterogeneous scenario can be very beneficial if the individual networks are combined with our new proposed method ScaleLSum.
Collapse
|
38
|
Differentially expressed genes between systemic sclerosis and rheumatoid arthritis. Hereditas 2019; 156:17. [PMID: 31178673 PMCID: PMC6549285 DOI: 10.1186/s41065-019-0091-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 05/10/2019] [Indexed: 12/23/2022] Open
Abstract
Background Evidence is accumulating to characterise the key differences between systemic sclerosis (SSc) and rheumatoid arthritis (RA), which are similar but distinct systemic autoimmune diseases. However, the differences at the genetic level are not yet clear. Therefore, the aim of the present study was to identify key differential genes between patients with SSc and RA. Methods The Gene Expression Omnibus database was used to identify differentially expressed genes (DEGs) between SSc and RA biopsies. The DEGs were then functionally annotated using Gene Ontology (GO) terms and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways with the Database for Annotation, Visualization and Integrated Discovery (DAVID) tools. A protein–protein interaction (PPI) network was constructed with Cytoscape software. The Molecular Complex Detection (MCODE) plugin was also used to evaluate the biological importance of the constructed gene modules. Results A total of 13,556 DEGs were identified between the five SSc patients and seven RA patients, including 13,465 up-regulated genes and 91 down-regulated genes. Interestingly, the most significantly enriched GO terms of up- and down-regulated genes were related to extracellular involvement and immune activity, respectively, and the top six highly enriched KEGG pathways were related to the same processes. In the PPI network, the top 10 hub nodes and top four modules harboured the most relevant genes contributing to the differences between SSc and RA, including key genes such as IL6, EGF, JUN, FGF2, BMP2, FOS, BMP4, LRRK2, CTNNB1, EP300, CD79, and CXCL13. Conclusions These genes such as IL6, EGF, JUN, FGF2, BMP2, FOS, BMP4, LRRK2, CTNNB1, EP300, CD79, and CXCL13 can serve as new targets for focused research on the distinct molecular pathogenesis of SSc and RA. Furthermore, these genes could serve as potential biomarkers for differential diagnoses or therapeutic targets for treatment. Electronic supplementary material The online version of this article (10.1186/s41065-019-0091-y) contains supplementary material, which is available to authorized users.
Collapse
|
39
|
A Novel Method for Identifying the Potential Cancer Driver Genes Based on Molecular Data Integration. Biochem Genet 2019; 58:16-39. [PMID: 31115714 DOI: 10.1007/s10528-019-09924-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2019] [Accepted: 05/02/2019] [Indexed: 12/17/2022]
Abstract
The identification of the cancer driver genes is essential for personalized therapy. The mutation frequency of most driver genes is in the middle (2-20%) or even lower range, which makes it difficult to find the driver genes with low-frequency mutations. Other forms of genomic aberrations, such as copy number variations (CNVs) and epigenetic changes, may also reflect cancer progression. In this work, a method for identifying the potential cancer driver genes (iPDG) based on molecular data integration is proposed. DNA copy number variation, somatic mutation, and gene expression data of matched cancer samples are integrated. In combination with the method of iKEEG, the "key genes" of cancer are identified, and the change in their expression levels is used for auxiliary evaluation of whether the mutated genes are potential drivers. For a mutated gene, the concept of mutational effect is defined, which takes into account the effects of copy number variation, mutation gene itself, and its neighbor genes. The method mainly includes two steps: the first step is data preprocessing. First, DNA copy number variation and somatic mutation data are integrated. Then, the integrated data are mapped to a given interaction network, and the diffusion kernel is used to form the mutation effect matrix. The second step is to obtain the key genes by using the iKGGE method, and construct the connection matrix by means of the gene expression data of the key genes and mutation impact matrix of the mutated genes. Experiments on TCGA breast cancer and Glioblastoma multiforme datasets demonstrate that iPDG is effective not only to identify the known cancer driver genes but also to discover the rare potential driver genes. When measured by functional enrichment analysis, we find that these genes are clearly associated with these two types of cancers.
Collapse
|
40
|
Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 176:173-193. [PMID: 31200905 DOI: 10.1016/j.cmpb.2019.04.008] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2018] [Revised: 02/28/2019] [Accepted: 04/08/2019] [Indexed: 02/08/2023]
Abstract
OBJECTIVE A colon microarray data is a repository of thousands of gene expressions with different strengths for each cancer cell. It is necessary to detect which genes are responsible for cancer growth. This study presents an exhaustive comparative study of different machine learning (ML) systems which serves two major purposes: (a) identification of high risk differential genes using statistical tests and (b) development of a ML strategy for predicting cancer genes. METHODS Four statistical tests namely: Wilcoxon sign rank sum (WCSRS), t test, Kruskal-Wallis (KW), and F-test were adapted for cancerous gene identification using their p-values. The extracted gene set was used to classify cancer patients using ten classifiers namely: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), naïve Bayes (NB), Gaussian process classification (GPC), support vector machine (SVM), artificial neural network (ANN), logistic regression (LR), decision tree (DT), Adaboost (AB), and random forest (RF). Performance was then evaluated using cross-validation protocols and standardized metrics viz. accuracy (ACC) and area under the curve (AUC). RESULTS The colon cancer dataset consists of 2000 genes from 62 patients (40 cancer vs. 22 control). The overall mean ACC of our ML system using all four statistical tests and all ten classifiers was 90.50%. The ML system showed an ACC of 99.81% using a combination WCSRS test and RF-based classifier. This is an improvement of 8% over previously published values in literature. CONCLUSIONS RF-based model with statistical tests for detection of high risk genes showed the best performance for accurate cancer classification in multi-center clinical trials.
Collapse
|
41
|
Active learning using rough fuzzy classifier for cancer prediction from microarray gene expression data. J Biomed Inform 2019; 92:103136. [PMID: 30802546 DOI: 10.1016/j.jbi.2019.103136] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2018] [Revised: 12/02/2018] [Accepted: 02/13/2019] [Indexed: 10/27/2022]
Abstract
Cancer classification from microarray gene expression data is one of the important areas of research in the field of computational biology and bioinformatics. Traditional supervised techniques often fail to produce desired accuracy as the number of clinically labeled patterns are very less. In such situation, active learning technique can play an important role as it computationally selects only few most informative (confusing) samples to be labeled by the experts and are added to the training set which inturn can improve the accuracy of the prediction. In this work a novel active learning method using rough-fuzzy classifier (ALRFC) is proposed for cancer sample classification using gene expression data. The proposed technique can handle uncertainty, overlappingness, and indiscernibility usually present in the subtype classes of the gene expression data. The proposed algorithm is tested using different publicly available benchmark cancer datasets and the performance is compared of the proposed method with three other active learning techniques, one semi-supervised classification algorithm, and two (non-active) supervised counterpart learning techniques in terms of prediction accuracy, precision, recall, F1-measures and kappa. Superiority of the proposed method for cancer prediction over the other state-of-art techniques is established from the experimental results. Statistical significance of the better results achieved by the proposed method (in comparison to other methods) is also confirmed from the paired t-test results for most of the datasets.
Collapse
|
42
|
Abstract
The surge of public disease and drug-related data availability has facilitated the application of computational methodologies to transform drug discovery. In the current chapter, we outline and detail the various resources and tools one can leverage in order to perform such analyses. We further describe in depth the in silico workflows of two recent studies that have identified possible novel indications of existing drugs. Lastly, we delve into the caveats and considerations of this process to enable other researchers to perform rigorous computational drug discovery experiments of their own.
Collapse
|
43
|
Abstract
In gene expression studies, missing values are a common problem with important consequences for the interpretation of the final data (Satija et al., Nat Biotechnol 33(5):495, 2015). Numerous bioinformatics examination tools are used for cancer prediction, including the data set matrix (Bailey et al., Cell 173(2):371-385, 2018); thus, it is necessary to resolve the problem of missing-values imputation. This chapter presents a review of the research on missing-values imputation approaches for gene expression data. By using local and global correlation of the data, we were able to focus mostly on the differences between the algorithms. We classified the algorithms as global, hybrid, local, or knowledge-based techniques. Additionally, this chapter presents suitable assessments of the different approaches. The purpose of this review is to focus on developments in the current techniques for scientists rather than applying different or newly developed algorithms with identical functional goals. The aim was to adapt the algorithms to the characteristics of the data.
Collapse
|
44
|
Identifying condition specific key genes from basal-like breast cancer gene expression data. Comput Biol Chem 2018; 78:367-374. [PMID: 30655072 DOI: 10.1016/j.compbiolchem.2018.12.022] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 12/28/2018] [Accepted: 12/28/2018] [Indexed: 11/24/2022]
Abstract
Mining patterns of co-expressed genes across the subset of conditions help to narrow down the search space for the analysis of gene expression data. Identifying conditions specific key genes from the large-scale gene expression data is a challenging task. The conditions specific key gene signifies functional behavior of a group of co-expressed genes across the subset of conditions and can be act as biomarkers of the diseases. In this paper, we have propose a novel approach for identification of conditions specific key genes from Basal-Like Breast Cancer (BLBC) disease using biclustering algorithm and Gene Co-expression Network (GCN). The proposed approach is a two-stage approach. In the first stage, significant biclusters have been extracted with the help of 'runibic' biclustering algorithm. The second stage identifies conditions specific key genes from the extracted significant biclusters with the help of GCN. By using difference matrix and gene correlation matrix, we have constructed biologically meaningful and statistically strong GCN. Also, presented the proposed approach with the help of a process diagram and demonstrated the procedure with an example of bicluster number 93 (Bic93). From the experimental results, we observed that 95% and 85% of the extracted biclusters are found to be biologically significant at the p-values less than 0.05 and 0.01 respectively. We have compared proposed approach with the Weighted Gene Co-expression Network Analysis (WGCNA) based approach. From the comparison, our approach has performed effectively and extracted biologically significant biclusters. Also, identified conditions specific key genes which cannot be extracted using the WGCNA based approach. Some of the important identified known key genes are PIK3CA, SHC3, ERBB2, SHC4, PTOV1, STAG1, ZNF215 etc. These key genes can be used as a diagnostic and prognostic biomarker for the BLBC disease after the rigorous analysis. The identified conditions specific key genes can be helpful to reduce the analysis time and increase the accuracy of further research such as biomarker identification, drug target discovery etc.
Collapse
|
45
|
Laplacian regularized low-rank representation for cancer samples clustering. Comput Biol Chem 2018; 78:504-509. [PMID: 30528509 DOI: 10.1016/j.compbiolchem.2018.11.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 11/07/2018] [Indexed: 12/18/2022]
Abstract
Cancer samples clustering based on biomolecular data has been becoming an important tool for cancer classification. The recognition of cancer types is of great importance for cancer treatment. In this paper, in order to improve the accuracy of cancer recognition, we propose to use Laplacian regularized Low-Rank Representation (LLRR) to cluster the cancer samples based on genomic data. In LLRR method, the high-dimensional genomic data are approximately treated as samples extracted from a combination of several low-rank subspaces. The purpose of LLRR method is to seek the lowest-rank representation matrix based on a dictionary. Because a Laplacian regularization based on manifold is introduced into LLRR, compared to the Low-Rank Representation (LRR) method, besides capturing the global geometric structure, LLRR can capture the intrinsic local structure of high-dimensional observation data well. And what is more, in LLRR, the original data themselves are selected as a dictionary, so the lowest-rank representation is actually a similar expression between the samples. Therefore, corresponding to the low-rank representation matrix, the samples with high similarity are considered to come from the same subspace and are grouped into a class. The experiment results on real genomic data illustrate that LLRR method, compared with LRR and MLLRR, is more robust to noise and has a better ability to learn the inherent subspace structure of data, and achieves remarkable performance in the clustering of cancer samples.
Collapse
|
46
|
A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 166:99-105. [PMID: 30415723 DOI: 10.1016/j.cmpb.2018.10.004] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2018] [Revised: 08/25/2018] [Accepted: 10/01/2018] [Indexed: 06/09/2023]
Abstract
BACKGROUND AND OBJECTIVE Cancer has become a complex health problem due to its high mortality. Over the past few decades, with the rapid development of the high-throughput sequencing technology and the application of various machine learning methods, remarkable progress in cancer research has been made based on gene expression data. At the same time, a growing amount of high-dimensional data has been generated, such as RNA-seq data, which calls for superior machine learning methods able to deal with mass data effectively in order to make accurate treatment decision. METHODS In this paper, we present a semi-supervised deep learning strategy, the stacked sparse auto-encoder (SSAE) based classification, for cancer prediction using RNA-seq data. The proposed SSAE based method employs the greedy layer-wise pre-training and a sparsity penalty term to help capture and extract important information from the high-dimensional data and then classify the samples. RESULTS We tested the proposed SSAE model on three public RNA-seq data sets of three types of cancers and compared the prediction performance with several commonly-used classification methods. The results indicate that our approach outperforms the other methods for all the three cancer data sets in various metrics. CONCLUSIONS The proposed SSAE based semi-supervised deep learning model shows its promising ability to process high-dimensional gene expression data and is proved to be effective and accurate for cancer prediction.
Collapse
|
47
|
Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biol 2018; 19:172. [PMID: 30359297 PMCID: PMC6203272 DOI: 10.1186/s13059-018-1536-8] [Citation(s) in RCA: 88] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 09/11/2018] [Indexed: 01/24/2023] Open
Abstract
Identifying co-expressed gene clusters can provide evidence for genetic or physical interactions. Thus, co-expression clustering is a routine step in large-scale analyses of gene expression data. We show that commonly used clustering methods produce results that substantially disagree and that do not match the biological expectations of co-expressed gene clusters. We present clust, a method that solves these problems by extracting clusters matching the biological expectations of co-expressed genes and outperforms widely used methods. Additionally, clust can simultaneously cluster multiple datasets, enabling users to leverage the large quantity of public expression data for novel comparative analysis. Clust is available at https://github.com/BaselAbujamous/clust.
Collapse
|
48
|
A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Min 2018; 11:16. [PMID: 30100924 PMCID: PMC6081857 DOI: 10.1186/s13040-018-0178-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 07/29/2018] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Biologists aim to understand the genetic background of diseases, metabolic disorders or any other genetic condition. Microarrays are one of the main high-throughput technologies for collecting information about the behaviour of genetic information on different conditions. In order to analyse this data, clustering arises as one of the main techniques used, and it aims at finding groups of genes that have some criterion in common, like similar expression profile. However, the problem of finding groups is normally multi dimensional, making necessary to approach the clustering as a multi-objective problem where various cluster validity indexes are simultaneously optimised. They are usually based on criteria like compactness and separation, which may not be sufficient since they can not guarantee the generation of clusters that have both similar expression patterns and biological coherence. METHOD We propose a Multi-Objective Clustering algorithm Guided by a-Priori Biological Knowledge (MOC-GaPBK) to find clusters of genes with high levels of co-expression, biological coherence, and also good compactness and separation. Cluster quality indexes are used to optimise simultaneously gene relationships at expression level and biological functionality. Our proposal also includes intensification and diversification strategies to improve the search process. RESULTS The effectiveness of the proposed algorithm is demonstrated on four publicly available datasets. Comparative studies of the use of different objective functions and other widely used microarray clustering techniques are reported. Statistical, visual and biological significance tests are carried out to show the superiority of the proposed algorithm. CONCLUSIONS Integrating a-priori biological knowledge into a multi-objective approach and using intensification and diversification strategies allow the proposed algorithm to find solutions with higher quality than other microarray clustering techniques available in the literature in terms of co-expression, biological coherence, compactness and separation.
Collapse
|
49
|
Effects of Fibronectin 1 on Cell Proliferation, Senescence and Apoptosis of Human Glioma Cells Through the PI3K/AKT Signaling Pathway. Cell Physiol Biochem 2018; 48:1382-1396. [PMID: 30048971 DOI: 10.1159/000492096] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 02/02/2018] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND/AIMS The current study aimed to investigate the role by which fibronectin 1 (FN1) influences the cell cycle, senescence and apoptosis in human glioma cells through the PI3K/ AKT signaling pathway. METHODS Differentially expressed genes (DEGs) were identified based on gene expression data (GSE12657, GSE15824 and GSE45921 datasets) and probe annotation files from Gene Expression Omnibus. The DEGs were identified in connection with gene ontology (GO) enrichment analysis and with the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. The positive expression of the FN1 protein was detected by immunohistochemistry. The glioma cell lines U251 and T98G were selected and assigned into blank, negative control (NC) and siRNA-FN1 groups. A dual luciferase reporter gene assay was used to investigate the effects of FN1 on transcriptional activity through the PI3K/AKT signaling pathway. An MTT assay was applied for the detection of cell proliferation, while flow cytometry was employed for cell cycle stage and cellular apoptosis detection. β-galactosidase staining was utilized to detect cellular senescence, a scratch test was applied to evaluate cell migration, and a transwell assay was used to analyze cell invasion. Western blotting and qRT-PCR methods were used to detect the protein and mRNA expression levels, respectively, of the FN1 gene and the related genes in the PI3K/AKT pathway (PI3K, AKT and PTEN), the cell cycle (pRb, CDK4 and Cyclin D1) and cell senescence (p16 and p21) among the collected tissues and cells. RESULTS GSE12657 profiling revealed FN1 to be the most upregulated gene in glioma. Regarding the GSE12657 and GSE15824 datasets, FN1 gene expression was higher in glioma tissues than in normal tissues. GO enrichment analysis and KEGG pathway enrichment analysis indicated that FN1 is involved in the synthesis of extracellular matrix (ECM) components and the PI3K/AKT signaling pathway. Verification was provided, indicating the role played by the FN1 gene in the regulation of the PI3K/AKT signaling pathway, as silencing the FN1 gene was found to inhibit cell proliferation, promote cell apoptosis and senescence, and reduce migration and invasion through the down-regulation of FN1 gene expression and disruption of the PI3K-AKT signaling pathway. CONCLUSION The findings of this study provide evidence highlighting the prominent role played by FN1 in stimulating glioma growth, invasion, and survival through the activation of the PI3K/AKT signaling pathway.
Collapse
|
50
|
An Ensemble Framework Coping with Instability in the Gene Selection Process. Interdiscip Sci 2018; 10:12-23. [PMID: 29313209 DOI: 10.1007/s12539-017-0274-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2017] [Revised: 11/06/2017] [Accepted: 11/08/2017] [Indexed: 11/29/2022]
Abstract
This paper proposes an ensemble framework for gene selection, which is aimed at addressing instability problems presented in the gene filtering task. The complex process of gene selection from gene expression data faces different instability problems from the informative gene subsets found by different filter methods. This makes the identification of significant genes by the experts difficult. The instability of results can come from filter methods, gene classifier methods, different datasets of the same disease and multiple valid groups of biomarkers. Even though there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This work proposes a framework involving five stages of gene filtering to discover biomarkers for diagnosis and classification tasks. This framework performs a process of stable feature selection, facing the problems above and, thus, providing a more suitable and reliable solution for clinical and research purposes. Our proposal involves a process of multistage gene filtering, in which several ensemble strategies for gene selection were added in such a way that different classifiers simultaneously assess gene subsets to face instability. Firstly, we apply an ensemble of recent gene selection methods to obtain diversity in the genes found (stability according to filter methods). Next, we apply an ensemble of known classifiers to filter genes relevant to all classifiers at a time (stability according to classification methods). The achieved results were evaluated in two different datasets of the same disease (pancreatic ductal adenocarcinoma), in search of stability according to the disease, for which promising results were achieved.
Collapse
|