1
|
Liu F, Yang Y, Xu XS, Yuan M. MESBC: A novel mutually exclusive spectral biclustering method for cancer subtyping. Comput Biol Chem 2024; 109:108009. [PMID: 38219419 DOI: 10.1016/j.compbiolchem.2023.108009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 12/22/2023] [Accepted: 12/24/2023] [Indexed: 01/16/2024]
Abstract
Many soft biclustering algorithms have been developed and applied to various biological and biomedical data analyses. However, few mutually exclusive (hard) biclustering algorithms have been proposed, which could better identify disease or molecular subtypes with survival significance based on genomic or transcriptomic data. In this study, we developed a novel mutually exclusive spectral biclustering (MESBC) algorithm based on spectral method to detect mutually exclusive biclusters. MESBC simultaneously detects relevant features (genes) and corresponding conditions (patients) subgroups and, therefore, automatically uses the signature features for each subtype to perform the clustering. Extensive simulations revealed that MESBC provided superior accuracy in detecting pre-specified biclusters compared with the non-negative matrix factorization (NMF) and Dhillon's algorithm, particularly in very noisy data. Further analysis of the algorithm on real datasets obtained from the TCGA database showed that MESBC provided more accurate (i.e., smaller p-value) overall survival prediction in patients with lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) cancers when compared to the existing, gold-standard subtypes for lung cancers (integrative clustering). Furthermore, MESBC detected several genes with significant prognostic value in both LUAD and LUSC patients. External validation on an independent, unseen GEO dataset of LUAD showed that MESBC-derived clusters based on TCGA data still exhibited clear biclustering patterns and consistent, outstanding prognostic predictability, demonstrating robust generalizability of MESBC. Therefore, MESBC could potentially be used as a risk stratification tool to optimize the treatment for the patient, improve the selection of patients for clinical trials, and contribute to the development of novel therapeutic agents.
Collapse
Affiliation(s)
- Fengrong Liu
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | | | - Min Yuan
- School of Public Health Administration, Anhui Medical University, Hefei 230032, China.
| |
Collapse
|
2
|
Xu X, Zhang S, Guo J, Xin T. Biclustering of Log Data: Insights from a Computer-Based Complex Problem Solving Assessment. J Intell 2024; 12:10. [PMID: 38248908 PMCID: PMC10817361 DOI: 10.3390/jintelligence12010010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Revised: 12/17/2023] [Accepted: 01/12/2024] [Indexed: 01/23/2024] Open
Abstract
Computer-based assessments provide the opportunity to collect a new source of behavioral data related to the problem-solving process, known as log file data. To understand the behavioral patterns that can be uncovered from these process data, many studies have employed clustering methods. In contrast to one-mode clustering algorithms, this study utilized biclustering methods, enabling simultaneous classification of test takers and features extracted from log files. By applying the biclustering algorithms to the "Ticket" task in the PISA 2012 CPS assessment, we evaluated the potential of biclustering algorithms in identifying and interpreting homogeneous biclusters from the process data. Compared with one-mode clustering algorithms, the biclustering methods could uncover clusters of individuals who are homogeneous on a subset of feature variables, holding promise for gaining fine-grained insights into students' problem-solving behavior patterns. Empirical results revealed that specific subsets of features played a crucial role in identifying biclusters. Additionally, the study explored the utilization of biclustering on both the action sequence data and timing data, and the inclusion of time-based features enhanced the understanding of students' action sequences and scores in the context of the analysis.
Collapse
Affiliation(s)
- Xin Xu
- Collaborative Innovation Center of Assessment for Basic Education Quality, Beijing Normal University, Beijing 100875, China;
| | - Susu Zhang
- Departments of Psychology and Statistics, University of Illinois Urbana-Champaign, Champaign, IL 61820, USA;
| | - Jinxin Guo
- College of Science, Minzu University of China, Beijing 100081, China;
| | - Tao Xin
- Collaborative Innovation Center of Assessment for Basic Education Quality, Beijing Normal University, Beijing 100875, China;
- School of Educational Science, Anhui Normal University, Wuhu 241000, China
| |
Collapse
|
3
|
Abstract
BACKGROUND Biclustering is increasingly used in biomedical data analysis, recommendation tasks, and text mining domains, with hundreds of biclustering algorithms proposed. When assessing the performance of these algorithms, more than real datasets are required as they do not offer a solid ground truth. Synthetic data surpass this limitation by producing reference solutions to be compared with the found patterns. However, generating synthetic datasets is challenging since the generated data must ensure reproducibility, pattern representativity, and real data resemblance. RESULTS We propose G-Bic, a dataset generator conceived to produce synthetic benchmarks for the normative assessment of biclustering algorithms. Beyond expanding on aspects of pattern coherence, data quality, and positioning properties, it further handles specificities related to mixed-type datasets and time-series data.G-Bic has the flexibility to replicate real data regularities from diverse domains. We provide the default configurations to generate reproducible benchmarks to evaluate and compare diverse aspects of biclustering algorithms. Additionally, we discuss empirical strategies to simulate the properties of real data. CONCLUSION G-Bic is a parametrizable generator for biclustering analysis, offering a solid means to assess biclustering solutions according to internal and external metrics robustly.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal.
| | - João P Lobo
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| | - Rui Henriques
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1900-001, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| |
Collapse
|
4
|
Chu HM, Kong XZ, Liu JX, Zheng CH, Zhang H. A New Binary Biclustering Algorithm Based on Weight Adjacency Difference Matrix for Analyzing Gene Expression Data. IEEE/ACM Trans Comput Biol Bioinform 2023; 20:2802-2809. [PMID: 37285246 DOI: 10.1109/tcbb.2023.3283801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Biclustering algorithms are essential for processing gene expression data. However, to process the dataset, most biclustering algorithms require preprocessing the data matrix into a binary matrix. Regrettably, this type of preprocessing may introduce noise or cause information loss in the binary matrix, which would reduce the biclustering algorithm's ability to effectively obtain the optimal biclusters. In this paper, we propose a new preprocessing method named Mean-Standard Deviation (MSD) to resolve the problem. Additionally, we introduce a new biclustering algorithm called Weight Adjacency Difference Matrix Binary Biclustering (W-AMBB) to effectively process datasets containing overlapping biclusters. The basic idea is to create a weighted adjacency difference matrix by applying weights to a binary matrix that is derived from the data matrix. This allows us to identify genes with significant associations in sample data by efficiently identifying similar genes that respond to specific conditions. Furthermore, the performance of the W-AMBB algorithm was tested on both synthetic and real datasets and compared with other classical biclustering methods. The experiment results demonstrate that the W-AMBB algorithm is significantly more robust than the compared biclustering methods on the synthetic dataset. Additionally, the results of the GO enrichment analysis show that the W-AMBB method possesses biological significance on real datasets.
Collapse
|
5
|
Cao S, Chang W, Wan C, Lu X, Dang P, Zhou X, Zhu H, Chen J, Li B, Zang Y, Wang Y, Zhang C. Pipeline for Characterizing Alternative Mechanisms (PCAM) based on bi-clustering to study colorectal cancer heterogeneity. Comput Struct Biotechnol J 2023; 21:2160-2171. [PMID: 37013005 PMCID: PMC10066523 DOI: 10.1016/j.csbj.2023.03.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 03/08/2023] [Accepted: 03/16/2023] [Indexed: 03/19/2023] Open
Abstract
The cells of colorectal cancer (CRC) in their microenvironment experience constant stress, leading to dysregulated activity in the tumor niche. As a result, cancer cells acquire alternative pathways in response to the changing microenvironment, posing significant challenges for the design of effective cancer treatment strategies. While computational studies on high-throughput omics data have advanced our understanding of CRC subtypes, characterizing the heterogeneity of this disease remains remarkably complex. Here, we present a novel computational Pipeline for Characterizing Alternative Mechanisms (PCAM) based on biclustering to gain a more detailed understanding of cancer heterogeneity. Our application of PCAM to large-scale CRC transcriptomics datasets suggests that PCAM can generate a wealth of information leading to new biological understanding and predictive markers of alternative mechanisms. Our key findings include: 1) A comprehensive collection of alternative pathways in CRC, associated with biological and clinical factors. 2) Full annotation of detected alternative mechanisms, including their enrichment in known pathways and associations with various clinical outcomes. 3) A mechanistic relationship between known clinical subtypes and outcomes on a consensus map, visualized by the presence of alternative mechanisms. 4) Several potential novel alternative drug resistance mechanisms for Oxaliplatin, 5-Fluorouracil, and FOLFOX, some of which were validated on independent datasets. We believe that gaining a deeper understanding of alternative mechanisms is a critical step towards characterizing the heterogeneity of CRC. The hypotheses generated by PCAM, along with the comprehensive collection of biologically and clinically associated alternative pathways in CRC, could provide valuable insights into the underlying mechanisms driving cancer progression and drug resistance, which could aid in the development of more effective cancer therapies and guide experimental design towards more targeted and personalized treatment strategies. The computational pipeline of PCAM is available in GitHub (https://github.com/changwn/BC-CRC).
Collapse
|
6
|
Yelugam R, Brito da Silva LE, Wunsch Ii DC. Topological biclustering ARTMAP for identifying within bicluster relationships. Neural Netw 2023; 160:34-49. [PMID: 36621169 DOI: 10.1016/j.neunet.2022.12.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 10/31/2022] [Accepted: 12/14/2022] [Indexed: 12/24/2022]
Abstract
Biclustering is a powerful tool for exploratory data analysis in domains such as social networking, data reduction, and differential gene expression studies. Topological learning identifies connected regions that are difficult to find using other traditional clustering methods and produces a graphical representation. Therefore, to improve the quality of biclustering and module extraction, this work combines the adaptive resonance theory (ART)-based methods of biclustering ARTMAP (BARTMAP) and topological ART (TopoART), to produce TopoBARTMAP. The latter inherits the ability to detect topological associations while performing data reduction. The capabilities of TopoBARTMAP were benchmarked using 35 real world cancer datasets and contrasted with other (bi)clustering methods, where it showed a statistically significant improvement over the other assessed methods on ordered and shuffled data experiments. In experiments with 12 synthetic datasets, the method was observed to perform better at identifying constant, scale, shift, and shift scale type biclusters. The produced graphical representation was refined to represent gene bicluster associations and was assessed on the NCBI GSE89116 dataset containing expression levels of 39,326 probes sampled over 38 observations.
Collapse
Affiliation(s)
- Raghu Yelugam
- Applied Computational Intelligence Laboratory, Missouri University of Science and Technology, Rolla, MO, USA.
| | | | - Donald C Wunsch Ii
- Applied Computational Intelligence Laboratory, Missouri University of Science and Technology, Rolla, MO, USA; National Science Foundation, ECCS Division, USA.
| |
Collapse
|
7
|
Karisani N, Platt DE, Basu S, Parida L. Topology and redescriptions detect multiple alternative biological pathways from clinical phenotypes. Exp Biol Med (Maywood) 2022; 247:2015-2024. [PMID: 36398440 PMCID: PMC9679317 DOI: 10.1177/15353702221126671] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Biological pathways play a crucial role in the properties of diseases and are important in drug discovery. Identifying the logical relationships among distinctive phenotypic clusters could reveal possible connections to the underlying pathways. However, this process is challenging since clinical phenotypes are often available through unstructured electronic health records. Moreover, in the absence of a standardized questionnaire, there could be bias among physicians toward selecting certain medical terms. In this article, we develop an efficient pipeline to address these challenges and help practitioners to reveal the pathways associated with the disease. We use topological data analysis and redescriptions and propose a pipeline of four phases: (1) pre-processing the clinical notes to extract the salient concepts, (2) constructing a feature space of the patients to characterize the extracted concepts, (3) leveraging the topological properties to distill the available knowledge and visualize the extracted features, and finally, (4) investigating the bias in the clinical notes of the selected features and identify possible pathways. Our experiments on a publicly available dataset of COVID-19 clinical notes testify that our pipeline can indeed extract meaningful pathways.
Collapse
Affiliation(s)
| | - Daniel E Platt
- IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA
| | - Saugata Basu
- Purdue University, West Lafayette, IN 47907, USA
| | - Laxmi Parida
- IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA,Laxmi Parida:
| |
Collapse
|
8
|
Alexandre L, Costa RS, Henriques R. DISA tool: Discriminative and informative subspace assessment with categorical and numerical outcomes. PLoS One 2022; 17:e0276253. [PMID: 36260602 DOI: 10.1371/journal.pone.0276253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Accepted: 10/03/2022] [Indexed: 11/19/2022] Open
Abstract
Pattern discovery and subspace clustering play a central role in the biological domain, supporting for instance putative regulatory module discovery from omics data for both descriptive and predictive ends. In the presence of target variables (e.g. phenotypes), regulatory patterns should further satisfy delineate discriminative power properties, well-established in the presence of categorical outcomes, yet largely disregarded for numerical outcomes, such as risk profiles and quantitative phenotypes. DISA (Discriminative and Informative Subspace Assessment), a Python software package, is proposed to evaluate patterns in the presence of numerical outcomes using well-established measures together with a novel principle able to statistically assess the correlation gain of the subspace against the overall space. Results confirm the possibility to soundly extend discriminative criteria towards numerical outcomes without the drawbacks well-associated with discretization procedures. Results from four case studies confirm the validity and relevance of the proposed methods, further unveiling critical directions for research on biotechnology and biomedicine. Availability: DISA is freely available at https://github.com/JupitersMight/DISA under the MIT license.
Collapse
|
9
|
Abstract
BACKGROUND The effectiveness of biclustering, simultaneous clustering of rows and columns in a data matrix, was shown in gene expression data analysis. Several researchers recognize its potentialities in other research areas. Nevertheless, the last two decades have witnessed the development of a significant number of biclustering algorithms targeting gene expression data analysis and a lack of consistent studies exploring the capacities of biclustering outside this traditional application domain. RESULTS This work evaluates the potential use of biclustering in fMRI time series data, targeting the Region × Time dimensions by comparing seven state-in-the-art biclustering and three traditional clustering algorithms on artificial and real data. It further proposes a methodology for biclustering evaluation beyond gene expression data analysis. The results discuss the use of different search strategies in both artificial and real fMRI time series showed the superiority of exhaustive biclustering approaches, obtaining the most homogeneous biclusters. However, their high computational costs are a challenge, and further work is needed for the efficient use of biclustering in fMRI data analysis. CONCLUSIONS This work pinpoints avenues for the use of biclustering in spatio-temporal data analysis, in particular neurosciences applications. The proposed evaluation methodology showed evidence of the effectiveness of biclustering in finding local patterns in fMRI time series data. Further work is needed regarding scalability to promote the application in real scenarios.
Collapse
Affiliation(s)
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| |
Collapse
|
10
|
Abstract
In biomedical data analysis, clustering is commonly conducted. Biclustering analysis conducts clustering in both the sample and covariate dimensions and can more comprehensively describe data heterogeneity. In most of the existing biclustering analyses, scalar measurements are considered. In this study, motivated by time-course gene expression data and other examples, we take the "natural next step" and consider the biclustering analysis of functionals under which, for each covariate of each sample, a function (to be exact, its values at discrete measurement points) is present. We develop a doubly penalized fusion approach, which includes a smoothness penalty for estimating functionals and, more importantly, a fusion penalty for clustering. Statistical properties are rigorously established, providing the proposed approach a strong ground. We also develop an effective ADMM algorithm and accompanying R code. Numerical analysis, including simulations, comparisons, and the analysis of two time-course gene expression data, demonstrates the practical effectiveness of the proposed approach.
Collapse
Affiliation(s)
- Kuangnan Fang
- Department of Statistics and Data Science, School of Economics, Xiamen University, China
| | - Yuanxing Chen
- Department of Statistics and Data Science, School of Economics, Xiamen University, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, United States of America
| | - Qingzhao Zhang
- MOE Key Laboratory of Econometrics, Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, China,Corresponding author. (Q. Zhang)
| |
Collapse
|
11
|
Chang H, Zhang H, Zhang T, Su L, Qin QM, Li G, Li X, Wang L, Zhao T, Zhao E, Zhao H, Liu Y, Stacey G, Xu D. A Multi-Level Iterative Bi-Clustering Method for Discovering miRNA Co-regulation Network of Abiotic Stress Tolerance in Soybeans. Front Plant Sci 2022; 13:860791. [PMID: 35463453 PMCID: PMC9021755 DOI: 10.3389/fpls.2022.860791] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Accepted: 02/24/2022] [Indexed: 06/14/2023]
Abstract
Although growing evidence shows that microRNA (miRNA) regulates plant growth and development, miRNA regulatory networks in plants are not well understood. Current experimental studies cannot characterize miRNA regulatory networks on a large scale. This information gap provides an excellent opportunity to employ computational methods for global analysis and generate valuable models and hypotheses. To address this opportunity, we collected miRNA-target interactions (MTIs) and used MTIs from Arabidopsis thaliana and Medicago truncatula to predict homologous MTIs in soybeans, resulting in 80,235 soybean MTIs in total. A multi-level iterative bi-clustering method was developed to identify 483 soybean miRNA-target regulatory modules (MTRMs). Furthermore, we collected soybean miRNA expression data and corresponding gene expression data in response to abiotic stresses. By clustering these data, 37 MTRMs related to abiotic stresses were identified, including stress-specific MTRMs and shared MTRMs. These MTRMs have gene ontology (GO) enrichment in resistance response, iron transport, positive growth regulation, etc. Our study predicts soybean MTRMs and miRNA-GO networks under different stresses, and provides miRNA targeting hypotheses for experimental analyses. The method can be applied to other biological processes and other plants to elucidate miRNA co-regulation mechanisms.
Collapse
Affiliation(s)
- Haowu Chang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Hao Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Tianyue Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Lingtao Su
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, China
| | - Qing-Ming Qin
- College of Plant Sciences and Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Jilin, China
| | - Guihua Li
- College of Plant Sciences and Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Jilin, China
| | - Xueqing Li
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Li Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Tianheng Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Enshuang Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Hengyi Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Yuanning Liu
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Gary Stacey
- Division of Plant Sciences and Technology, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
12
|
Zeng Z, Jiang X, Pan Z, Zhou R, Lin Z, Tang Y, Cui Y, Zhang E, Cao Z. Highly expressed centromere protein L indicates adverse survival and associates with immune infiltration in hepatocellular carcinoma. Aging (Albany NY) 2021; 13:22802-22829. [PMID: 34607313 PMCID: PMC8544325 DOI: 10.18632/aging.203574] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 09/11/2021] [Indexed: 12/12/2022]
Abstract
BACKGROUND Hepatocellular carcinoma (HCC) is characterized by rapid progression, high recurrence rate and poor prognosis. Early prediction for the prognosis and immunotherapy efficacy is of great significance to improve the survival of HCC patients. However, there is still no reliable predictor at present. This study is aimed to explore the role of centromere protein L (CENPL) in predicting prognosis and its association with immune infiltration in HCC. METHODS The expression of CENPL was identified through analyzing the Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) data. The association between CENPL expression and clinicopathological features was investigated by the Wilcoxon signed-rank test or Kruskal Wallis test and logistic regression. The role of CENPL in prognosis was examined via Kaplan-Meier method and Log-rank test as well as univariate and multivariate Cox regression analysis. Besides, in TIMER and GEPIA database, we investigated the correlation between CENPL level and immunocyte and immunocyte markers, and the prognostic-related methylation sites in CENPL were identified by MethSurv. RESULTS CENPL had a high expression in HCC samples. Increased CENPL was prominently associated with unfavorable survival, and maybe an independent prognostic factor of worse overall survival (OS), disease-specific survival (DSS), disease-free interval (DFI), progression-free interval (PFI). Additionally, CENPL expression was significantly correlated with immune cell infiltration and some markers. CENPL also contained a methylation site that was notably related to poor prognosis. CONCLUSIONS Elevated CENPL may be a promising prognostic marker and associate with immune infiltration in HCC.
Collapse
Affiliation(s)
- Zhili Zeng
- The First School of Clinical Medicine, Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510405, PR China
| | - Xiao Jiang
- The First School of Clinical Medicine, Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510405, PR China
| | - Zhibin Pan
- Foshan Hospital of Traditional Chinese Medicine, Guangzhou University of Chinese Medicine, Foshan 528000, Guangdong, PR China
| | - Ruisheng Zhou
- The First School of Clinical Medicine, Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510405, PR China
| | - Zhuangteng Lin
- Department of Medical Technologic, The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 518000, PR China
| | - Ying Tang
- Department of Oncology, Lingnan Medical Research Center of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510405, PR China.,Department of Oncology, The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 518000, PR China
| | - Ying Cui
- Department of Psychiatry, The Third Affiliated Hospital of Guangzhou Medical University, Guangdong 510150, PR China
| | - Enxin Zhang
- Department of Oncology, The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 518000, PR China.,Department of Oncology, Shenzhen Hospital of Guangzhou University of Chinese Medicine, Shenzhen 518000, Guangdong, PR China
| | - Zebiao Cao
- The First School of Clinical Medicine, Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510405, PR China
| |
Collapse
|
13
|
Maâtouk O, Ayadi W, Bouziri H, Duval B. Evolutionary Local Search Algorithm for the biclustering of gene expression data based on biological knowledge. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107177] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
14
|
Melton S, Ramanathan S. Discovering a sparse set of pairwise discriminating features in high-dimensional data. Bioinformatics 2021; 37:202-212. [PMID: 32730566 PMCID: PMC8599814 DOI: 10.1093/bioinformatics/btaa690] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Revised: 06/30/2020] [Accepted: 07/23/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. RESULTS We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. AVAILABILITY AND IMPLEMENTATION https://github.com/smelton/SMD. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Samuel Melton
- Applied Mathematics Harvard University, Cambridge, MA 02138, USA
| | - Sharad Ramanathan
- Applied Physics, John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA
- Department of Stem Cell and Regenerative Biology, Cambridge, MA 02138, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| |
Collapse
|
15
|
Gerniers A, Bricard O, Dupont P. MicroCellClust: mining rare and highly specific subpopulations from single-cell expression data. Bioinformatics 2021; 37:3220-3227. [PMID: 33830183 PMCID: PMC8504615 DOI: 10.1093/bioinformatics/btab239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 03/10/2021] [Accepted: 04/07/2021] [Indexed: 11/24/2022] Open
Abstract
Motivation Identifying rare subpopulations of cells is a critical step in order to extract knowledge from single-cell expression data, especially when the available data is limited and rare subpopulations only contain a few cells. In this paper, we present a data mining method to identify small subpopulations of cells that present highly specific expression profiles. This objective is formalized as a constrained optimization problem that jointly identifies a small group of cells and a corresponding subset of specific genes. The proposed method extends the max-sum submatrix problem to yield genes that are, for instance, highly expressed inside a small number of cells, but have a low expression in the remaining ones. Results We show through controlled experiments on scRNA-seq data that the MicroCellClust method achieves a high F1 score to identify rare subpopulations of artificially planted human T cells. The effectiveness of MicroCellClust is confirmed as it reveals a subpopulation of CD4 T cells with a specific phenotype from breast cancer samples, and a subpopulation linked to a specific stage in the cell cycle from breast cancer samples as well. Finally, three rare subpopulations in mouse embryonic stem cells are also identified with MicroCellClust. These results illustrate the proposed method outperforms typical alternatives at identifying small subsets of cells with highly specific expression profiles. Availabilityand implementation The R and Scala implementation of MicroCellClust is freely available on GitHub, at https://github.com/agerniers/MicroCellClust/ The data underlying this article are available on Zenodo, at https://dx.doi.org/10.5281/zenodo.4580332. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alexander Gerniers
- ICTEAM/INGI/Artificial Intelligence and Algorithms group, UCLouvain, Louvain-la-Neuve, 1348, Belgium
| | - Orian Bricard
- de Duve Institute, UCLouvain, Brussels, 1200, Belgium
| | - Pierre Dupont
- ICTEAM/INGI/Artificial Intelligence and Algorithms group, UCLouvain, Louvain-la-Neuve, 1348, Belgium
| |
Collapse
|
16
|
Zhang J, Liu L, Xu T, Zhang W, Zhao C, Li S, Li J, Rao N, Le TD. miRSM: an R package to infer and analyse miRNA sponge modules in heterogeneous data. RNA Biol 2021; 18:2308-2320. [PMID: 33822666 DOI: 10.1080/15476286.2021.1905341] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
In molecular biology, microRNA (miRNA) sponges are RNA transcripts which compete with other RNA transcripts for binding with miRNAs. Research has shown that miRNA sponges have a fundamental impact on tissue development and disease progression. Generally, to achieve a specific biological function, miRNA sponges tend to form modules or communities in a biological system. Until now, however, there is still a lack of tools to aid researchers to infer and analyse miRNA sponge modules from heterogeneous data. To fill this gap, we develop an R/Bioconductor package, miRSM, for facilitating the procedure of inferring and analysing miRNA sponge modules. miRSM provides a collection of 50 co-expression analysis methods to identify gene co-expression modules (which are candidate miRNA sponge modules), four module discovery methods to infer miRNA sponge modules and seven modular analysis methods for investigating miRNA sponge modules. miRSM will enable researchers to quickly apply new datasets to infer and analyse miRNA sponge modules, and will consequently accelerate the research on miRNA sponges.
Collapse
Affiliation(s)
- Junpeng Zhang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.,School of Engineering, Dali University, Dali, Yunnan, China
| | - Lin Liu
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| | - Taosheng Xu
- Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui, China
| | - Wu Zhang
- School of Agriculture and Biological Sciences, Dali University, Dali, Yunnan, China
| | - Chunwen Zhao
- School of Engineering, Dali University, Dali, Yunnan, China
| | - Sijing Li
- School of Engineering, Dali University, Dali, Yunnan, China
| | - Jiuyong Li
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| | - Nini Rao
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Thuc Duy Le
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| |
Collapse
|
17
|
Liang L, Zhu K, Tao J, Lu S. ORN: Inferring patient-specific dysregulation status of pathway modules in cancer with OR-gate Network. PLoS Comput Biol 2021; 17:e1008792. [PMID: 33819263 DOI: 10.1371/journal.pcbi.1008792] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 04/15/2021] [Accepted: 02/15/2021] [Indexed: 01/26/2023] Open
Abstract
Pathway level understanding of cancer plays a key role in precision oncology. However, the current amount of high-throughput data cannot support the elucidation of full pathway topology. In this study, instead of directly learning the pathway network, we adapted the probabilistic OR gate to model the modular structure of pathways and regulon. The resulting model, OR-gate Network (ORN), can simultaneously infer pathway modules of somatic alterations, patient-specific pathway dysregulation status, and downstream regulon. In a trained ORN, the differentially expressed genes (DEGs) in each tumour can be explained by somatic mutations perturbing a pathway module. Furthermore, the ORN handles one of the most important properties of pathway perturbation in tumours, the mutual exclusivity. We have applied the ORN to lower-grade glioma (LGG) samples and liver hepatocellular carcinoma (LIHC) samples in TCGA and breast cancer samples from METABRIC. Both datasets have shown abnormal pathway activities related to immune response and cell cycles. In LGG samples, ORN identified pathway modules closely related to glioma development and revealed two pathways closely related to patient survival. We had similar results with LIHC samples. Additional results from the METABRIC datasets showed that ORN could characterize critical mechanisms of cancer and connect them to less studied somatic mutations (e.g., BAP1, MIR604, MICAL3, and telomere activities), which may generate novel hypothesis for targeted therapy. Cellular functions are carried out by a set of gene products. Mutation of a single gene is often sufficient to disrupt certain biological functions and promote tumorigenesis. Therefore, genes participating in the same function are less likely to mutate in the same sample. Such phenomenon is called “mutual exclusivity”. In this study, our algorithm (ORN) has utilized this property to identify gene-level mutations that affect similar biological functions. It also considers mutations’ impact on mRNA expression. Functional modules identified by ORN tends to be mutually exclusive while causing similar differential expression profiles. When we applied ORN to lower-grade glioma and liver cancer datasets, we have identified gene modules significantly related to patient survival. Furthermore, across different types of cancer, ORN has connected well-known cancer driver mutations with genes whose functions remain unclear. These connections, once validated, can generate novel hypothesis for biologist to further investigate cancer mechanism and develop targeted therapy.
Collapse
|
18
|
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, Berland M, Gruca A, Hasic J, Hron K, Klammsteiner T, Kolev M, Lahti L, Lopes MB, Moreno V, Naskinova I, Org E, Paciência I, Papoutsoglou G, Shigdel R, Stres B, Vilne B, Yousef M, Zdravevski E, Tsamardinos I, Carrillo de Santa Pau E, Claesson MJ, Moreno-Indias I, Truu J. Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment. Front Microbiol 2021; 12:634511. [PMID: 33737920 PMCID: PMC7962872 DOI: 10.3389/fmicb.2021.634511] [Citation(s) in RCA: 113] [Impact Index Per Article: 37.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | | | | | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Jasminka Hasic
- University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Mikhail Kolev
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Victor Moreno
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO)Barcelona, Spain
- Colorectal Cancer Group, Institut de Recerca Biomedica de Bellvitge (IDIBELL), Barcelona, Spain
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
- Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain
| | - Irina Naskinova
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Elin Org
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
| | - Inês Paciência
- EPIUnit – Instituto de Saúde Pública da Universidade do Porto, Porto, Portugal
| | | | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Group for Microbiology and Microbial Biotechnology, Department of Animal Science, University of Ljubljana, Ljubljana, Slovenia
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Eftim Zdravevski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | | | | | - Marcus J. Claesson
- School of Microbiology & APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Isabel Moreno-Indias
- Unidad de Gestión Clínica de Endocrinología y Nutrición, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Clínico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomédica en Red de Fisiopatología de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
19
|
Liang L, Zhu K, Lu S. BEM: Mining Coregulation Patterns in Transcriptomics via Boolean Matrix Factorization. Bioinformatics 2020; 36:4030-4037. [PMID: 31913438 DOI: 10.1093/bioinformatics/btz977] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2019] [Revised: 11/21/2019] [Accepted: 01/02/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The matrix factorization is an important way to analyze coregulation patterns in transcriptomic data, which can reveal the tumor signal perturbation status and subtype classification. However, current matrix factorization methods do not provide clear bicluster structure. Furthermore, these algorithms are based on the assumption of linear combination, which may not be sufficient to capture the coregulation patterns. RESULTS We presented a new algorithm for Boolean matrix factorization (BMF) via expectation maximization (BEM). BEM is more aligned with the molecular mechanism of transcriptomic coregulation and can scale to matrix with over 100 million data points. Synthetic experiments showed that BEM outperformed other BMF methods in terms of reconstruction error. Real-world application demonstrated that BEM is applicable to all kinds of transcriptomic data, including bulk RNA-seq, single-cell RNA-seq and spatial transcriptomic datasets. Given appropriate binarization, BEM was able to extract coregulation patterns consistent with disease subtypes, cell types or spatial anatomy. AVAILABILITY AND IMPLEMENTATION Python source code of BEM is available on https://github.com/LifanLiang/EM_BMF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lifan Liang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206-3701, USA
| | - Kunju Zhu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206-3701, USA.,Department of Central Lab., Clinical Medicine Research Institute, Jinan University, Guangzhou, Guangdong 51063, China
| | - Songjian Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206-3701, USA
| |
Collapse
|
20
|
Xie J, Ma A, Zhang Y, Liu B, Cao S, Wang C, Xu J, Zhang C, Ma Q. QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data. Bioinformatics 2020; 36:1143-1149. [PMID: 31503285 PMCID: PMC8215922 DOI: 10.1093/bioinformatics/btz692] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 08/05/2019] [Accepted: 09/05/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed. RESULTS We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq. AVAILABILITY AND IMPLEMENTATION The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Juan Xie
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yu Zhang
- Colleges of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Sha Cao
- Department of Biostatistics, Indiana University, School of Medicine, Indianapolis, IN 46202, USA
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Jennifer Xu
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Chi Zhang
- Department of Medical & Molecular Genetics, Indiana University, School of Medicine, Indianapolis, IN 46202, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
21
|
Abstract
BACKGROUND Transcriptome analysis aims at gaining insight into cellular processes through discovering gene expression patterns across various experimental conditions. Biclustering is a standard approach to discover genes subsets with similar expression across subgroups of samples to be identified. The result is a set of biclusters, each forming a specific submatrix of rows (e.g. genes) and columns (e.g. samples). Relevant biclusters can, however, be missed when, due to the presence of a few outliers, they lack the assumed homogeneity of expression values among a few gene/sample combinations. The Max-Sum SubMatrix problem addresses this issue by looking at highly expressed subsets of genes and of samples, without enforcing such homogeneity. RESULTS We present here the K-CPGC algorithm to identify K relevant submatrices. Our main contribution is to show that this approach outperforms biclustering algorithms to identify several gene subsets representative of specific subgroups of samples. Experiments are conducted on 35 gene expression datasets from human tissues and yeast samples. We report comparative results with those obtained by several biclustering algorithms, including CCA, xMOTIFs, ISA, QUBIC, Plaid and Spectral. Gene enrichment analysis demonstrates the benefits of the proposed approach to identify more statistically significant gene subsets. The most significant Gene Ontology terms identified with K-CPGC are shown consistent with the controlled conditions of each dataset. This analysis supports the biological relevance of the identified gene subsets. An additional contribution is the statistical validation protocol proposed here to assess the relative performances of biclustering algorithms and of the proposed method. It relies on a Friedman test and the Hochberg's sequential procedure to report critical differences of ranks among all algorithms. CONCLUSIONS We propose here the K-CPGC method, a computationally efficient algorithm to identify K max-sum submatrices in a large gene expression matrix. Comparisons show that it identifies more significantly enriched subsets of genes and specific subgroups of samples which are easily interpretable by biologists. Experiments also show its ability to identify more reliable GO terms. These results illustrate the benefits of the proposed approach in terms of interpretability and of biological enrichment quality. Open implementation of this algorithm is available as an R package.
Collapse
Affiliation(s)
- Vincent Branders
- Université catholique de Louvain - ICTEAM/INGI - Machine Learning Group, Place Sainte Barbe 2, Louvain-la-Neuve, 1348 Belgium
| | - Pierre Schaus
- Université catholique de Louvain - ICTEAM/INGI - Machine Learning Group, Place Sainte Barbe 2, Louvain-la-Neuve, 1348 Belgium
| | - Pierre Dupont
- Université catholique de Louvain - ICTEAM/INGI - Machine Learning Group, Place Sainte Barbe 2, Louvain-la-Neuve, 1348 Belgium
| |
Collapse
|
22
|
Orzechowski P, Boryczko K, Moore JH. Scalable biclustering - the future of big data exploration? Gigascience 2019; 8:5524762. [PMID: 31251324 PMCID: PMC6598466 DOI: 10.1093/gigascience/giz078] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 06/07/2019] [Accepted: 06/11/2019] [Indexed: 02/07/2023] Open
Abstract
Biclustering is a technique of discovering local similarities within data. For many years the complexity of the methods and parallelization issues limited its application to big data problems. With the development of novel scalable methods, biclustering has finally started to close this gap. In this paper we discuss the caveats of biclustering and present its current challenges and guidelines for practitioners. We also try to explain why biclustering may soon become one of the standards for big data analytics.
Collapse
Affiliation(s)
- Patryk Orzechowski
- Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA.,Department of Automatics and Robotics, AGH University of Science and Technology, al. A. Mickiewicza 30, Kraków 30-059, Poland
| | - Krzysztof Boryczko
- Department of Computer Science, AGH University of Science and Technology, al. A. Mickiewicza 30, Kraków 30-059, Poland
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA
| |
Collapse
|