1
|
Baruah B, Dutta MP, Banerjee S, Bhattacharyya DK. EnsemBic: An effective ensemble of biclustering to identify potential biomarkers of esophageal squamous cell carcinoma. Comput Biol Chem 2024; 110:108090. [PMID: 38759483 DOI: 10.1016/j.compbiolchem.2024.108090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 03/28/2024] [Accepted: 04/29/2024] [Indexed: 05/19/2024]
Abstract
The development of functionally enriched and biologically competent biclustering algorithm is essential for extracting hidden information from massive biological datasets. This paper presents a novel biclustering ensemble called EnsemBic based on p-value, which calculates the functional similarity of genetic associations. To validate the effectiveness and robustness of EnsemBic, we apply three well-known biclustering techniques, viz. Laplace Prior, iBBiG, and xMotif to implement EnsemBic and have been compared using different leading parameters. It is observed that the EnsemBic outperforms its competing algorithms in several prominent functional and biological measures. Next, the biclusters obtained from EnsemBic are used to identify potential biomarkers of Esophageal Squamous Cell Carcinoma (ESCC) by exploring topological and biological relevance with reference to the elite genes, attained from genecards. Finally, we discover that the genes F2RL3, APPL1, CALM1, IFNGR1, LPAR1, ANGPT2, ARPC2, CGN, CLDN7, ATP6V1C2, CEACAM1, FTL, PLAU,PSMB4, and EPHB2 carry both the topological and biological significance of previously established ESCC elite genes. Therefore, we declare the aforementioned genes as potential biomarkers of ESCC.
Collapse
Affiliation(s)
- Bikash Baruah
- Dept. of Computer Science and Engineering, NIT Arunachal Pradesh, India
| | - Manash P Dutta
- Dept. of Computer Science & Information Technology, Cotton University, Guwahati, Assam, India.
| | | | - Dhruba K Bhattacharyya
- Dept. of Computer Science and Engineering, Tezpur University, School of Engineering, Tezpur, India
| |
Collapse
|
2
|
Jia X, Yin Z, Peng Y. Gene differential co-expression analysis of male infertility patients based on statistical and machine learning methods. Front Microbiol 2023; 14:1092143. [PMID: 36778885 PMCID: PMC9911419 DOI: 10.3389/fmicb.2023.1092143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 01/11/2023] [Indexed: 01/28/2023] Open
Abstract
Male infertility has always been one of the important factors affecting the infertility of couples of gestational age. The reasons that affect male infertility includes living habits, hereditary factors, etc. Identifying the genetic causes of male infertility can help us understand the biology of male infertility, as well as the diagnosis of genetic testing and the determination of clinical treatment options. While current research has made significant progress in the genes that cause sperm defects in men, genetic studies of sperm content defects are still lacking. This article is based on a dataset of gene expression data on the X chromosome in patients with azoospermia, mild and severe oligospermia. Due to the difference in the degree of disease between patients and the possible difference in genetic causes, common classical clustering methods such as k-means, hierarchical clustering, etc. cannot effectively identify samples (realize simultaneous clustering of samples and features). In this paper, we use machine learning and various statistical methods such as hypergeometric distribution, Gibbs sampling, Fisher test, etc. and genes the interaction network for cluster analysis of gene expression data of male infertility patients has certain advantages compared with existing methods. The cluster results were identified by differential co-expression analysis of gene expression data in male infertility patients, and the model recognition clusters were analyzed by multiple gene enrichment methods, showing different degrees of enrichment in various enzyme activities, cancer, virus-related, ATP and ADP production, and other pathways. At the same time, as this paper is an unsupervised analysis of genetic factors of male infertility patients, we constructed a simulated data set, in which the clustering results have been determined, which can be used to measure the effect of discriminant model recognition. Through comparison, it finds that the proposed model has a better identification effect.
Collapse
|
3
|
Rodrigues P, Costa RS, Henriques R. Enrichment analysis on regulatory subspaces: A novel direction for the superior description of cellular responses to SARS-CoV-2. Comput Biol Med 2022; 146:105443. [PMID: 35533463 PMCID: PMC9040465 DOI: 10.1016/j.compbiomed.2022.105443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 03/13/2022] [Accepted: 03/20/2022] [Indexed: 12/16/2022]
Abstract
STATEMENT Enrichment analysis of cell transcriptional responses to SARS-CoV-2 infection from biclustering solutions yields broader coverage and superior enrichment of GO terms and KEGG pathways against alternative state-of-the-art machine learning solutions, thus aiding knowledge extraction. MOTIVATION AND METHODS The comprehensive understanding of the impacts of SARS-CoV-2 virus on infected cells is still incomplete. This work aims at comparing the role of state-of-the-art machine learning approaches in the study of cell regulatory processes affected and induced by the SARS-CoV-2 virus using transcriptomic data from both infectable cell lines available in public databases and in vivo samples. In particular, we assess the relevance of clustering, biclustering and predictive modeling methods for functional enrichment. Statistical principles to handle scarcity of observations, high data dimensionality, and complex gene interactions are further discussed. In particular, and without loos of generalization ability, the proposed methods are applied to study the differential regulatory response of lung cell lines to SARS-CoV-2 (α-variant) against RSV, IAV (H1N1), and HPIV3 viruses. RESULTS Gathered results show that, although clustering and predictive algorithms aid classic stances to functional enrichment analysis, more recent pattern-based biclustering algorithms significantly improve the number and quality of enriched GO terms and KEGG pathways with controlled false positive risks. Additionally, a comparative analysis of these results is performed to identify potential pathophysiological characteristics of COVID-19. These are further compared to those identified by other authors for the same virus as well as related ones such as SARS-CoV-1. The findings are particularly relevant given the lack of other works utilizing more complex machine learning algorithms within this context.
Collapse
Affiliation(s)
- Pedro Rodrigues
- IDMEC, Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal; INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal
| | - Rafael S Costa
- IDMEC, Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal; LAQV-REQUIMTE, DQ, NOVA School of Science and Technology, Caparica, Portugal
| | - Rui Henriques
- INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal.
| |
Collapse
|
4
|
Mandal K, Sarmah R, Bhattacharyya DK. POPBic: Pathway-Based Order Preserving Biclustering Algorithm Towards the Analysis of Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2659-2670. [PMID: 32175872 DOI: 10.1109/tcbb.2020.2980816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
To understand the underlying biological mechanisms of gene expression data, it is important to discover the groups of genes that have similar expression patterns under certain subsets of conditions. Biclustering algorithms have been effective in analyzing large-scale gene expression data. Recently, traditional biclustering has been improved by introducing biological knowledge along with the expression data during the biclustering process. In this paper, we propose the Pathway-based Order Preserving Biclustering (POPBic) algorithm by incorporating Kyoto Encyclopedia of Genes and Genomes (KEGG) based on the hypothesis that two genes sharing similar pathways are likely to be similar. The basic principle of the POPBic approach is to apply the concept of Longest Common Subsequence between a pair of genes which have a high number of common pathways. The algorithm identifies the expression patterns from data using two major steps: (i) selection of significant seed genes and (ii) extraction of biclusters. We performe exhaustive experimentation with the POPBic algorithm using synthetic dataset to evaluate the bicluster model, finding its robustness in the presence of noise and identifying overlapping biclusters. We demonstrate that POPBic is able to discover biologically significant biclusters for four cancer microarray gene expression datasets. POPBic has been found to perform consistently well in comparison to its closest competitors.
Collapse
|
5
|
Alexandre L, Costa RS, Santos LL, Henriques R. Mining Pre-Surgical Patterns Able to Discriminate Post-Surgical Outcomes in the Oncological Domain. IEEE J Biomed Health Inform 2021; 25:2421-2434. [PMID: 33687853 DOI: 10.1109/jbhi.2021.3064786] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Understanding the individualized risks of undertaking surgical procedures is essential to personalize preparatory, intervention and post-care protocols for minimizing post-surgical complications. This knowledge is key in oncology given the nature of interventions, the fragile profile of patients with comorbidities and cytotoxic drug exposure, and the possible cancer recurrence. Despite its relevance, the discovery of discriminative patterns of post-surgical risk is hampered by major challenges: i) the unique physiological and demographic profile of individuals, as well as their differentiated post-surgical care; ii) the high-dimensionality and heterogeneous nature of available biomedical data, combining non-identically distributed risk factors, clinical and molecular variables; iii) the need to generalize tumors have significant histopathological differences and individuals undertake unique surgical procedures; iv) the need to focus on non-trivial patterns of post-surgical risk, while guaranteeing their statistical significance and discriminative power; and v) the lack of interpretability and actionability of current approaches. Biclustering, the discovery of groups of individuals correlated on subsets of variables, has unique properties of interest, being positioned to satisfy the aforementioned challenges. In this context, this work proposes a structured view on why, when and how to apply biclustering to mine discriminative patterns of post-surgical risk with guarantees of usability, a subject remaining unexplored up to date. These patterns offer a comprehensive view on how the patient profile, cancer histopathology and entailed surgical procedures determine: i) post-surgical complications, ii) survival, and iii) hospitalization needs. The gathered results confirm the role of biclustering in comprehensively finding interpretable, actionable and statistically significant patterns of post-surgical risk. The found patterns are already assisting healthcare professionals at IPO-Porto to establish specialized pre-habilitation protocols and bedside care.
Collapse
|
6
|
Maâtouk O, Ayadi W, Bouziri H, Duval B. Evolutionary Local Search Algorithm for the biclustering of gene expression data based on biological knowledge. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107177] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
7
|
Nam JH, Couch D, da Silveira WA, Yu Z, Chung D. PALMER: improving pathway annotation based on the biomedical literature mining with a constrained latent block model. BMC Bioinformatics 2020; 21:432. [PMID: 33008309 PMCID: PMC7532116 DOI: 10.1186/s12859-020-03756-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Accepted: 09/16/2020] [Indexed: 11/23/2022] Open
Abstract
Background In systems biology, it is of great interest to identify previously unreported associations between genes. Recently, biomedical literature has been considered as a valuable resource for this purpose. While classical clustering algorithms have popularly been used to investigate associations among genes, they are not tuned for the literature mining data and are also based on strong assumptions, which are often violated in this type of data. For example, these approaches often assume homogeneity and independence among observations. However, these assumptions are often violated due to both redundancies in functional descriptions and biological functions shared among genes. Latent block models can be alternatives in this case but they also often show suboptimal performances, especially when signals are weak. In addition, they do not allow to utilize valuable prior biological knowledge, such as those available in existing databases. Results In order to address these limitations, here we propose PALMER, a constrained latent block model that allows to identify indirect relationships among genes based on the biomedical literature mining data. By automatically associating relevant Gene Ontology terms, PALMER facilitates biological interpretation of novel findings without laborious downstream analyses. PALMER also allows researchers to utilize prior biological knowledge about known gene-pathway relationships to guide identification of gene–gene associations. We evaluated PALMER with simulation studies and applications to studies of pathway-modulating genes relevant to cancer signaling pathways, while utilizing biological pathway annotations available in the KEGG database as prior knowledge. Conclusions We showed that PALMER outperforms traditional latent block models and it provides reliable identification of novel gene–gene associations by utilizing prior biological knowledge, especially when signals are weak in the biomedical literature mining dataset. We believe that PALMER and its relevant user-friendly software will be powerful tools that can be used to improve existing pathway annotations and identify novel pathway-modulating genes.
Collapse
Affiliation(s)
- Jin Hyun Nam
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA.,School of Pharmacy, Sungkyunkwan University, Suwon, Republic of Korea
| | - Daniel Couch
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA
| | | | - Zhenning Yu
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA
| | - Dongjun Chung
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
8
|
Xie J, Ma A, Fennell A, Ma Q, Zhao J. It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data. Brief Bioinform 2020; 20:1449-1464. [PMID: 29490019 DOI: 10.1093/bib/bby014] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Revised: 01/16/2018] [Indexed: 12/12/2022] Open
Abstract
Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.
Collapse
|
9
|
Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F. Moving from Formal Towards Coherent Concept Analysis: Why, When and How. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7148255 DOI: 10.1007/978-3-030-45439-5_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Formal concept analysis has been largely applied to explore taxonomic relationships and derive ontologies from text collections. Despite its recognized relevance, it generally misses relevant concept associations and suffers from the need to learn from Boolean space models. Biclustering, the discovery of coherent concept associations (subsets of documents correlated on subsets of terms and topics), is here suggested to address the aforementioned problems. This work proposes a structured view on why, when and how to apply biclustering for concept analysis, a subject remaining largely unexplored up to date. Gathered results from a large text collection confirm the relevance of biclustering to find less-trivial, yet actionable and statistically significant concept associations.
Collapse
|
10
|
Nepomuceno JA, Troncoso A, Nepomuceno-Chamorro IA, Aguilar-Ruiz JS. Pairwise gene GO-based measures for biclustering of high-dimensional expression data. BioData Min 2018; 11:4. [PMID: 29610579 PMCID: PMC5872503 DOI: 10.1186/s13040-018-0165-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Accepted: 03/01/2018] [Indexed: 11/15/2022] Open
Abstract
Background Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. Results The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. Conclusions It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.
Collapse
Affiliation(s)
- Juan A Nepomuceno
- 1Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Avd. Reina Mercedes s/n, Seville, 41012 Spain
| | - Alicia Troncoso
- 2Área de Informática, Universidad Pablo de Olavide, Ctra. Utrera km. 1, Seville, 41013 Spain
| | - Isabel A Nepomuceno-Chamorro
- 1Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Avd. Reina Mercedes s/n, Seville, 41012 Spain
| | - Jesús S Aguilar-Ruiz
- 2Área de Informática, Universidad Pablo de Olavide, Ctra. Utrera km. 1, Seville, 41013 Spain
| |
Collapse
|
11
|
Houari A, Ayadi W, Ben Yahia S. A new FCA-based method for identifying biclusters in gene expression data. INT J MACH LEARN CYB 2018. [DOI: 10.1007/s13042-018-0794-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
12
|
Henriques R, Madeira SC. BSig: evaluating the statistical significance of biclustering solutions. Data Min Knowl Discov 2017. [DOI: 10.1007/s10618-017-0521-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
13
|
Henriques R, Ferreira FL, Madeira SC. BicPAMS: software for biological data analysis with pattern-based biclustering. BMC Bioinformatics 2017; 18:82. [PMID: 28153040 PMCID: PMC5290636 DOI: 10.1186/s12859-017-1493-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 01/21/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Biclustering has been largely applied for the unsupervised analysis of biological data, being recognised today as a key technique to discover putative modules in both expression data (subsets of genes correlated in subsets of conditions) and network data (groups of coherently interconnected biological entities). However, given its computational complexity, only recent breakthroughs on pattern-based biclustering enabled efficient searches without the restrictions that state-of-the-art biclustering algorithms place on the structure and homogeneity of biclusters. As a result, pattern-based biclustering provides the unprecedented opportunity to discover non-trivial yet meaningful biological modules with putative functions, whose coherency and tolerance to noise can be tuned and made problem-specific. METHODS To enable the effective use of pattern-based biclustering by the scientific community, we developed BicPAMS (Biclustering based on PAttern Mining Software), a software that: 1) makes available state-of-the-art pattern-based biclustering algorithms (BicPAM (Henriques and Madeira, Alg Mol Biol 9:27, 2014), BicNET (Henriques and Madeira, Alg Mol Biol 11:23, 2016), BicSPAM (Henriques and Madeira, BMC Bioinforma 15:130, 2014), BiC2PAM (Henriques and Madeira, Alg Mol Biol 11:1-30, 2016), BiP (Henriques and Madeira, IEEE/ACM Trans Comput Biol Bioinforma, 2015), DeBi (Serin and Vingron, AMB 6:1-12, 2011) and BiModule (Okada et al., IPSJ Trans Bioinf 48(SIG5):39-48, 2007)); 2) consistently integrates their dispersed contributions; 3) further explores additional accuracy and efficiency gains; and 4) makes available graphical and application programming interfaces. RESULTS Results on both synthetic and real data confirm the relevance of BicPAMS for biological data analysis, highlighting its essential role for the discovery of putative modules with non-trivial yet biologically significant functions from expression and network data. CONCLUSIONS BicPAMS is the first biclustering tool offering the possibility to: 1) parametrically customize the structure, coherency and quality of biclusters; 2) analyze large-scale biological networks; and 3) tackle the restrictive assumptions placed by state-of-the-art biclustering algorithms. These contributions are shown to be key for an adequate, complete and user-assisted unsupervised analysis of biological data. SOFTWARE BicPAMS and its tutorial available in http://www.bicpams.com .
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | | | - Sara C. Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|