1
|
Jain N, Ghosh S, Ghosh A. A parameter free relative density based biclustering method for identifying non-linear feature relations. Heliyon 2024; 10:e34736. [PMID: 39157398 PMCID: PMC11327522 DOI: 10.1016/j.heliyon.2024.e34736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 07/09/2024] [Accepted: 07/16/2024] [Indexed: 08/20/2024] Open
Abstract
The existing biclustering algorithms often depend on assumptions like monotonicity or linearity of feature relations for finding biclusters. Though a few algorithms overcome this problem using density-based methods, they tend to miss out many biclusters because they use global criteria for identifying dense regions. The proposed method, PF-RelDenBi, uses local variations in marginal and joint densities for each pair of features to find the subset of observations, forming the basis of the relation between them. It then finds the set of features connected by a common set of observations using a non-linear feature relation index, resulting in a bicluster. This approach allows us to find biclusters based on feature relations, even if the relations are non-linear or non-monotonous. Additionally, the proposed method does not require the user to provide any parameters, allowing its application to datasets from different domains. To study the behaviour of PF-RelDenBi on datasets with different properties, experiments were carried out on sixteen simulated datasets and the performance has been compared with eleven state-of-the-art algorithms. The proposed method is seen to produce better results for most of the simulated datasets. Experiments were conducted with five benchmark datasets and biclusters were detected using PF-RelDenBi. For the first two datasets, the detected biclusters were used to generate additional features that improved classification performance. For the other three datasets, the performance of PF-RelDenBi was compared with the eleven state-of-the-art methods in terms of accuracy, NMI and ARI. The proposed method is seen to detect biclusters with greater accuracy. The proposed technique has also been applied to the COVID-19 dataset to identify some demographic features that are likely to affect the spread of COVID-19.
Collapse
Affiliation(s)
- Namita Jain
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Susmita Ghosh
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Ashish Ghosh
- International Institute of Information Technology, Bhubaneswar 751003, India
| |
Collapse
|
2
|
Castanho EN, Aidos H, Madeira SC. Biclustering data analysis: a comprehensive survey. Brief Bioinform 2024; 25:bbae342. [PMID: 39007596 PMCID: PMC11247412 DOI: 10.1093/bib/bbae342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 05/16/2024] [Accepted: 07/01/2024] [Indexed: 07/16/2024] Open
Abstract
Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| |
Collapse
|
3
|
Chen S, Zhang L, Liu H. Biclustering for Epi-Transcriptomic Co-functional Analysis. Methods Mol Biol 2024; 2822:293-309. [PMID: 38907925 DOI: 10.1007/978-1-0716-3918-4_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/24/2024]
Abstract
Dynamic and reversible N6-methyladenosine (m6A) modifications are associated with many essential cellular functions as well as physiological and pathological phenomena. In-depth study of m6A co-functional patterns in epi-transcriptomic data may help to understand its complex regulatory mechanisms. In this chapter, we describe several biclustering mining algorithms for epi-transcriptomic data to discover potential co-functional patterns. The concepts and computational methods discussed in this chapter will be particularly useful for researchers working in related fields. We also aim to introduce new deep learning techniques into the field of co-functional analysis of epi-transcriptomic data.
Collapse
Affiliation(s)
- Shutao Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China.
| | - Hui Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China.
| |
Collapse
|
4
|
Chu HM, Kong XZ, Liu JX, Zheng CH, Zhang H. A New Binary Biclustering Algorithm Based on Weight Adjacency Difference Matrix for Analyzing Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2802-2809. [PMID: 37285246 DOI: 10.1109/tcbb.2023.3283801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Biclustering algorithms are essential for processing gene expression data. However, to process the dataset, most biclustering algorithms require preprocessing the data matrix into a binary matrix. Regrettably, this type of preprocessing may introduce noise or cause information loss in the binary matrix, which would reduce the biclustering algorithm's ability to effectively obtain the optimal biclusters. In this paper, we propose a new preprocessing method named Mean-Standard Deviation (MSD) to resolve the problem. Additionally, we introduce a new biclustering algorithm called Weight Adjacency Difference Matrix Binary Biclustering (W-AMBB) to effectively process datasets containing overlapping biclusters. The basic idea is to create a weighted adjacency difference matrix by applying weights to a binary matrix that is derived from the data matrix. This allows us to identify genes with significant associations in sample data by efficiently identifying similar genes that respond to specific conditions. Furthermore, the performance of the W-AMBB algorithm was tested on both synthetic and real datasets and compared with other classical biclustering methods. The experiment results demonstrate that the W-AMBB algorithm is significantly more robust than the compared biclustering methods on the synthetic dataset. Additionally, the results of the GO enrichment analysis show that the W-AMBB method possesses biological significance on real datasets.
Collapse
|
5
|
Zhang F, Zhang Y, Hou T, Ren F, Liu X, Zhao R, Zhang X. Screening of Genes Related to Breast Cancer Prognosis Based on the DO-UniBIC Method. Am J Med Sci 2022; 364:333-342. [DOI: 10.1016/j.amjms.2022.04.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 10/04/2021] [Accepted: 04/08/2022] [Indexed: 11/01/2022]
|
6
|
Zhang L, Chen S, Ma J, Liu Z, Liu H. REW-ISA V2: A Biclustering Method Fusing Homologous Information for Analyzing and Mining Epi-Transcriptome Data. Front Genet 2021; 12:654820. [PMID: 34122508 PMCID: PMC8194299 DOI: 10.3389/fgene.2021.654820] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Accepted: 04/28/2021] [Indexed: 01/08/2023] Open
Abstract
Background: Previous studies have shown that N6-methyladenosine (m6A) is related to many life processes and physiological and pathological phenomena. However, the specific regulatory mechanism of m6A sites at the systematic level is not clear. Therefore, mining the RNA co-methylation patterns in the epi-transcriptome data is expected to explain the specific regulation mechanism of m6A. Methods: Considering that the epi-transcriptome data contains homologous information (the genes corresponding to the m6A sites and the cell lines corresponding to the experimental conditions), rational use of this information will help reveal the regulatory mechanism of m6A. Therefore, based on the RNA expression weighted iterative signature algorithm (REW-ISA), we have fused homologous information and developed the REW-ISA V2 algorithm. Results: Then, REW-ISA V2 was applied in the MERIP-seq data to find potential local function blocks (LFBs), where sites are hyper-methylated simultaneously across the specific conditions. Finally, REW-ISA V2 obtained fifteen LFBs. Compared with the most advanced biclustering algorithm, the LFBs obtained by REW-ISA V2 have more significant biological significance. Further biological analysis showed that these LFBs were highly correlated with some signal pathways and m6A methyltransferase. Conclusion: REW-ISA V2 fuses homologous information to mine co-methylation patterns in the epi-transcriptome data, in which sites are co-methylated under specific conditions.
Collapse
Affiliation(s)
- Lin Zhang
- Engineering Research Center of Intelligent Control for Underground Space, China University of Mining and Technology, Ministry of Education, Xuzhou, China.,School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Shutao Chen
- Engineering Research Center of Intelligent Control for Underground Space, China University of Mining and Technology, Ministry of Education, Xuzhou, China.,School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Jiani Ma
- Engineering Research Center of Intelligent Control for Underground Space, China University of Mining and Technology, Ministry of Education, Xuzhou, China.,School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Zhaoyang Liu
- Engineering Research Center of Intelligent Control for Underground Space, China University of Mining and Technology, Ministry of Education, Xuzhou, China.,School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Hui Liu
- Engineering Research Center of Intelligent Control for Underground Space, China University of Mining and Technology, Ministry of Education, Xuzhou, China.,School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| |
Collapse
|
7
|
Oh M, Kim K, Sun H. Covariance thresholding to detect differentially co-expressed genes from microarray gene expression data. J Bioinform Comput Biol 2021; 18:2050002. [PMID: 32336254 DOI: 10.1142/s021972002050002x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Gene set analysis aims to identify differentially expressed or co-expressed genes within a biological pathway between two experimental conditions, so that it can eventually reveal biological processes and pathways involved in disease development. In the last few decades, various statistical and computational methods have been proposed to improve statistical power of gene set analysis. In recent years, much attention has been paid to differentially co-expressed genes since they can be potentially disease-related genes without significant difference in average expression levels between two conditions. In this paper, we propose a new statistical method to identify differentially co-expressed genes from microarray gene expression data. The proposed method first estimates co-expression levels of paired genes using covariance regularization by thresholding, and then significance of difference in covariance estimation between two conditions is evaluated. We demonstrated that the proposed method is more powerful than the existing main-stream methods to detect co-expressed genes through extensive simulation studies. Also, we applied it to various microarray gene expression datasets related with mutant p53 transcriptional activity, and epithelium and stroma breast cancer.
Collapse
Affiliation(s)
- Mingyu Oh
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| |
Collapse
|
8
|
Orzechowski P, Moore JH. EBIC: an open source software for high-dimensional and big data analyses. Bioinformatics 2020; 35:3181-3183. [PMID: 30649199 DOI: 10.1093/bioinformatics/btz027] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Revised: 11/17/2018] [Accepted: 01/12/2019] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION In this paper, we present an open source package with the latest release of Evolutionary-based BIClustering (EBIC), a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding a full support for multiple graphics processing units (GPUs) support, which makes it possible to run efficiently large genomic data mining analyses. Multiple enhancements to the first release of the algorithm include integration with R and Bioconductor, and an option to exclude missing values from the analysis. RESULTS Evolutionary-based BIClustering was applied to datasets of different sizes, including a large DNA methylation dataset with 436 444 rows. For the largest dataset we observed over 6.6-fold speedup in computation time on a cluster of eight GPUs compared to running the method on a single GPU. This proves high scalability of the method. AVAILABILITY AND IMPLEMENTATION The latest version of EBIC could be downloaded from http://github.com/EpistasisLab/ebic. Installation and usage instructions are also available online. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Patryk Orzechowski
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.,Department of Automatics and Robotics, AGH University of Science and Technology, Krakow, Poland
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
9
|
Maind A, Raut S. Mining conditions specific hub genes from RNA-Seq gene-expression data via biclustering and their application to drug discovery. IET Syst Biol 2020; 13:194-203. [PMID: 31318337 PMCID: PMC8687431 DOI: 10.1049/iet-syb.2018.5058] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Gene‐expression data is being widely used for various clinical research. It represents expression levels of thousands of genes across the various experimental conditions simultaneously. Mining conditions specific hub genes from gene‐expression data is a challenging task. Conditions specific hub genes signify the functional behaviour of bicluster across the subset of conditions and can act as prognostic or diagnostic markers of the diseases. In this study, the authors have introduced a new approach for identifying conditions specific hub genes from the RNA‐Seq data using a biclustering algorithm. In the proposed approach, efficient ‘runibic’ biclustering algorithm, the concept of gene co‐expression network and concept of protein–protein interaction network have been used for getting better performance. The result shows that the proposed approach extracts biologically significant conditions specific hub genes which play an important role in various biological processes and pathways. These conditions specific hub genes can be used as prognostic or diagnostic biomarkers. Conditions specific hub genes will be helpful to reduce the analysis time and increase the accuracy of further research. Also, they summarised application of the proposed approach to the drug discovery process.
Collapse
Affiliation(s)
- Ankush Maind
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, Maharashtra, India.
| | - Shital Raut
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, Maharashtra, India
| |
Collapse
|
10
|
Ge SX, Son EW, Yao R. iDEP: an integrated web application for differential expression and pathway analysis of RNA-Seq data. BMC Bioinformatics 2018; 19:534. [PMID: 30567491 PMCID: PMC6299935 DOI: 10.1186/s12859-018-2486-6] [Citation(s) in RCA: 955] [Impact Index Per Article: 136.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Accepted: 11/12/2018] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND RNA-seq is widely used for transcriptomic profiling, but the bioinformatics analysis of resultant data can be time-consuming and challenging, especially for biologists. We aim to streamline the bioinformatic analyses of gene-level data by developing a user-friendly, interactive web application for exploratory data analysis, differential expression, and pathway analysis. RESULTS iDEP (integrated Differential Expression and Pathway analysis) seamlessly connects 63 R/Bioconductor packages, 2 web services, and comprehensive annotation and pathway databases for 220 plant and animal species. The workflow can be reproduced by downloading customized R code and related pathway files. As an example, we analyzed an RNA-Seq dataset of lung fibroblasts with Hoxa1 knockdown and revealed the possible roles of SP1 and E2F1 and their target genes, including microRNAs, in blocking G1/S transition. In another example, our analysis shows that in mouse B cells without functional p53, ionizing radiation activates the MYC pathway and its downstream genes involved in cell proliferation, ribosome biogenesis, and non-coding RNA metabolism. In wildtype B cells, radiation induces p53-mediated apoptosis and DNA repair while suppressing the target genes of MYC and E2F1, and leads to growth and cell cycle arrest. iDEP helps unveil the multifaceted functions of p53 and the possible involvement of several microRNAs such as miR-92a, miR-504, and miR-30a. In both examples, we validated known molecular pathways and generated novel, testable hypotheses. CONCLUSIONS Combining comprehensive analytic functionalities with massive annotation databases, iDEP ( http://ge-lab.org/idep/ ) enables biologists to easily translate transcriptomic and proteomic data into actionable insights.
Collapse
Affiliation(s)
- Steven Xijin Ge
- Department of Mathematics and Statistics, South Dakota State University, Box 2225, Brookings, SD 57007 USA
| | - Eun Wo Son
- Department of Mathematics and Statistics, South Dakota State University, Box 2225, Brookings, SD 57007 USA
| | - Runan Yao
- Department of Mathematics and Statistics, South Dakota State University, Box 2225, Brookings, SD 57007 USA
| |
Collapse
|