1
|
Patrício A, Costa RS, Henriques R. Pattern-centric transformation of omics data grounded on discriminative gene associations aids predictive tasks in TCGA while ensuring interpretability. Biotechnol Bioeng 2024; 121:2881-2892. [PMID: 38859573 DOI: 10.1002/bit.28758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 02/07/2024] [Accepted: 05/18/2024] [Indexed: 06/12/2024]
Abstract
The increasing prevalence of omics data sources is pushing the study of regulatory mechanisms underlying complex diseases such as cancer. However, the vast quantities of molecular features produced and the inherent interplay between them lead to a level of complexity that hampers both descriptive and predictive tasks, requiring custom-built algorithms that can extract relevant information from these sources of data. We propose a transformation that moves data centered on molecules (e.g., transcripts and proteins) to a new data space focused on putative regulatory modules given by statistically relevant co-expression patterns. To this end, the proposed transformation extracts patterns from the data through biclustering and uses them to create new variables with guarantees of interpretability and discriminative power. The transformation is shown to achieve dimensionality reductions of up to 99% and increase predictive performance of various classifiers across multiple omics layers. Results suggest that omics data transformations from gene-centric to pattern-centric data supports both prediction tasks and human interpretation, notably contributing to precision medicine applications.
Collapse
Affiliation(s)
- André Patrício
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
- LAQV-REQUIMTE, Department of Chemistry, NOVA School of Science and Technology, NOVA University Lisbon, Caparica, Portugal
| | - Rafael S Costa
- LAQV-REQUIMTE, Department of Chemistry, NOVA School of Science and Technology, NOVA University Lisbon, Caparica, Portugal
| | - Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
2
|
Castanho EN, Aidos H, Madeira SC. Biclustering data analysis: a comprehensive survey. Brief Bioinform 2024; 25:bbae342. [PMID: 39007596 PMCID: PMC11247412 DOI: 10.1093/bib/bbae342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 05/16/2024] [Accepted: 07/01/2024] [Indexed: 07/16/2024] Open
Abstract
Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| |
Collapse
|
3
|
Castanho EN, Lobo JP, Henriques R, Madeira SC. G-bic: generating synthetic benchmarks for biclustering. BMC Bioinformatics 2023; 24:457. [PMID: 38053078 PMCID: PMC10698934 DOI: 10.1186/s12859-023-05587-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 11/28/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Biclustering is increasingly used in biomedical data analysis, recommendation tasks, and text mining domains, with hundreds of biclustering algorithms proposed. When assessing the performance of these algorithms, more than real datasets are required as they do not offer a solid ground truth. Synthetic data surpass this limitation by producing reference solutions to be compared with the found patterns. However, generating synthetic datasets is challenging since the generated data must ensure reproducibility, pattern representativity, and real data resemblance. RESULTS We propose G-Bic, a dataset generator conceived to produce synthetic benchmarks for the normative assessment of biclustering algorithms. Beyond expanding on aspects of pattern coherence, data quality, and positioning properties, it further handles specificities related to mixed-type datasets and time-series data.G-Bic has the flexibility to replicate real data regularities from diverse domains. We provide the default configurations to generate reproducible benchmarks to evaluate and compare diverse aspects of biclustering algorithms. Additionally, we discuss empirical strategies to simulate the properties of real data. CONCLUSION G-Bic is a parametrizable generator for biclustering analysis, offering a solid means to assess biclustering solutions according to internal and external metrics robustly.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal.
| | - João P Lobo
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| | - Rui Henriques
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1900-001, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| |
Collapse
|
4
|
DISA tool: Discriminative and informative subspace assessment with categorical and numerical outcomes. PLoS One 2022; 17:e0276253. [PMID: 36260602 PMCID: PMC9581374 DOI: 10.1371/journal.pone.0276253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Accepted: 10/03/2022] [Indexed: 11/19/2022] Open
Abstract
Pattern discovery and subspace clustering play a central role in the biological domain, supporting for instance putative regulatory module discovery from omics data for both descriptive and predictive ends. In the presence of target variables (e.g. phenotypes), regulatory patterns should further satisfy delineate discriminative power properties, well-established in the presence of categorical outcomes, yet largely disregarded for numerical outcomes, such as risk profiles and quantitative phenotypes. DISA (Discriminative and Informative Subspace Assessment), a Python software package, is proposed to evaluate patterns in the presence of numerical outcomes using well-established measures together with a novel principle able to statistically assess the correlation gain of the subspace against the overall space. Results confirm the possibility to soundly extend discriminative criteria towards numerical outcomes without the drawbacks well-associated with discretization procedures. Results from four case studies confirm the validity and relevance of the proposed methods, further unveiling critical directions for research on biotechnology and biomedicine. Availability: DISA is freely available at https://github.com/JupitersMight/DISA under the MIT license.
Collapse
|
5
|
Rodrigues P, Costa RS, Henriques R. Enrichment analysis on regulatory subspaces: A novel direction for the superior description of cellular responses to SARS-CoV-2. Comput Biol Med 2022; 146:105443. [PMID: 35533463 PMCID: PMC9040465 DOI: 10.1016/j.compbiomed.2022.105443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 03/13/2022] [Accepted: 03/20/2022] [Indexed: 12/16/2022]
Abstract
STATEMENT Enrichment analysis of cell transcriptional responses to SARS-CoV-2 infection from biclustering solutions yields broader coverage and superior enrichment of GO terms and KEGG pathways against alternative state-of-the-art machine learning solutions, thus aiding knowledge extraction. MOTIVATION AND METHODS The comprehensive understanding of the impacts of SARS-CoV-2 virus on infected cells is still incomplete. This work aims at comparing the role of state-of-the-art machine learning approaches in the study of cell regulatory processes affected and induced by the SARS-CoV-2 virus using transcriptomic data from both infectable cell lines available in public databases and in vivo samples. In particular, we assess the relevance of clustering, biclustering and predictive modeling methods for functional enrichment. Statistical principles to handle scarcity of observations, high data dimensionality, and complex gene interactions are further discussed. In particular, and without loos of generalization ability, the proposed methods are applied to study the differential regulatory response of lung cell lines to SARS-CoV-2 (α-variant) against RSV, IAV (H1N1), and HPIV3 viruses. RESULTS Gathered results show that, although clustering and predictive algorithms aid classic stances to functional enrichment analysis, more recent pattern-based biclustering algorithms significantly improve the number and quality of enriched GO terms and KEGG pathways with controlled false positive risks. Additionally, a comparative analysis of these results is performed to identify potential pathophysiological characteristics of COVID-19. These are further compared to those identified by other authors for the same virus as well as related ones such as SARS-CoV-1. The findings are particularly relevant given the lack of other works utilizing more complex machine learning algorithms within this context.
Collapse
Affiliation(s)
- Pedro Rodrigues
- IDMEC, Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal; INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal
| | - Rafael S Costa
- IDMEC, Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal; LAQV-REQUIMTE, DQ, NOVA School of Science and Technology, Caparica, Portugal
| | - Rui Henriques
- INESC-ID and Instituto Superior Tecnico, Universidade de Lisboa, Lisbon, Portugal.
| |
Collapse
|
6
|
Abstract
Sensors deployed within water distribution systems collect consumption data that enable the application of data analysis techniques to extract essential information. Time series clustering has been traditionally applied for modeling end-user water consumption profiles to aid water management. However, its effectiveness is limited by the diversity and local nature of consumption patterns. In addition, existing techniques cannot adequately handle changes in household composition, disruptive events (e.g., vacations), and consumption dynamics at different time scales. In this context, biclustering approaches provide a natural alternative to detect groups of end-users with coherent consumption profiles during local time periods while addressing the aforementioned limitations. This work discusses when, why and how to apply biclustering techniques for water consumption data analysis, and further proposes a methodology to this end. To the best of our knowledge, this is the first work introducing biclustering to water consumption data analysis. Results on data from a real-world water distribution system—Quinta do Lago, Portugal—confirm the potentialities of the proposed approach for pattern discovery with guarantees of statistical significance and robustness that entities can rely on for strategic planning.
Collapse
|
7
|
Castanho EN, Aidos H, Madeira SC. Biclustering fMRI time series: a comparative study. BMC Bioinformatics 2022; 23:192. [PMID: 35606701 PMCID: PMC9126639 DOI: 10.1186/s12859-022-04733-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 05/13/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The effectiveness of biclustering, simultaneous clustering of rows and columns in a data matrix, was shown in gene expression data analysis. Several researchers recognize its potentialities in other research areas. Nevertheless, the last two decades have witnessed the development of a significant number of biclustering algorithms targeting gene expression data analysis and a lack of consistent studies exploring the capacities of biclustering outside this traditional application domain. RESULTS This work evaluates the potential use of biclustering in fMRI time series data, targeting the Region × Time dimensions by comparing seven state-in-the-art biclustering and three traditional clustering algorithms on artificial and real data. It further proposes a methodology for biclustering evaluation beyond gene expression data analysis. The results discuss the use of different search strategies in both artificial and real fMRI time series showed the superiority of exhaustive biclustering approaches, obtaining the most homogeneous biclusters. However, their high computational costs are a challenge, and further work is needed for the efficient use of biclustering in fMRI data analysis. CONCLUSIONS This work pinpoints avenues for the use of biclustering in spatio-temporal data analysis, in particular neurosciences applications. The proposed evaluation methodology showed evidence of the effectiveness of biclustering in finding local patterns in fMRI time series data. Further work is needed regarding scalability to promote the application in real scenarios.
Collapse
Affiliation(s)
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C. Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
8
|
Chang H, Zhang H, Zhang T, Su L, Qin QM, Li G, Li X, Wang L, Zhao T, Zhao E, Zhao H, Liu Y, Stacey G, Xu D. A Multi-Level Iterative Bi-Clustering Method for Discovering miRNA Co-regulation Network of Abiotic Stress Tolerance in Soybeans. FRONTIERS IN PLANT SCIENCE 2022; 13:860791. [PMID: 35463453 PMCID: PMC9021755 DOI: 10.3389/fpls.2022.860791] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Accepted: 02/24/2022] [Indexed: 06/14/2023]
Abstract
Although growing evidence shows that microRNA (miRNA) regulates plant growth and development, miRNA regulatory networks in plants are not well understood. Current experimental studies cannot characterize miRNA regulatory networks on a large scale. This information gap provides an excellent opportunity to employ computational methods for global analysis and generate valuable models and hypotheses. To address this opportunity, we collected miRNA-target interactions (MTIs) and used MTIs from Arabidopsis thaliana and Medicago truncatula to predict homologous MTIs in soybeans, resulting in 80,235 soybean MTIs in total. A multi-level iterative bi-clustering method was developed to identify 483 soybean miRNA-target regulatory modules (MTRMs). Furthermore, we collected soybean miRNA expression data and corresponding gene expression data in response to abiotic stresses. By clustering these data, 37 MTRMs related to abiotic stresses were identified, including stress-specific MTRMs and shared MTRMs. These MTRMs have gene ontology (GO) enrichment in resistance response, iron transport, positive growth regulation, etc. Our study predicts soybean MTRMs and miRNA-GO networks under different stresses, and provides miRNA targeting hypotheses for experimental analyses. The method can be applied to other biological processes and other plants to elucidate miRNA co-regulation mechanisms.
Collapse
Affiliation(s)
- Haowu Chang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Hao Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Tianyue Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Lingtao Su
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, China
| | - Qing-Ming Qin
- College of Plant Sciences and Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Jilin, China
| | - Guihua Li
- College of Plant Sciences and Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Jilin, China
| | - Xueqing Li
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Li Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Tianheng Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Enshuang Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Hengyi Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Yuanning Liu
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Gary Stacey
- Division of Plant Sciences and Technology, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
9
|
Abstract
AbstractBiclustering is a two-dimensional data analysis technique that, applied to a matrix, searches for a subset of rows and columns that intersect to produce a submatrix with given, expected features. Such an approach requires different methods to those of typical classification or regression tasks. In recent years it has become possible to express biclustering goals in the form of Boolean reasoning. This paper presents a new, heuristic approach to bicluster induction in binary data.
Collapse
|
10
|
Alexandre L, Costa RS, Henriques R. DI2: prior-free and multi-item discretization of biological data and its applications. BMC Bioinformatics 2021; 22:426. [PMID: 34496758 PMCID: PMC8425008 DOI: 10.1186/s12859-021-04329-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Accepted: 08/23/2021] [Indexed: 11/24/2022] Open
Abstract
Background A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations. Results In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. Statistical tests applied to assess differences in performance confirm that DI2 generally outperforms well-established discretizations methods with statistical significance. Within classification tasks, DI2 displays either competitive or superior levels of predictive accuracy, particularly delineate for classifiers able to accommodate border values. Conclusions This work proposes a new unsupervised method for data discretization, DI2, that takes into account the underlying data regularities, the presence of outlier values disrupting expected regularities, as well as the relevance of border values. DI2 is available at https://github.com/JupitersMight/DI2 Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04329-8.
Collapse
Affiliation(s)
- Leonardo Alexandre
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001, Lisbon, Portugal. .,INESC-ID, Lisbon, Portugal. .,Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal.
| | - Rafael S Costa
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001, Lisbon, Portugal.,LAQV-REQUIMTE, DQ, NOVA School of Science and Technology, Universidade NOVA de Lisboa, 2829-516, Caparica, Portugal
| | - Rui Henriques
- INESC-ID, Lisbon, Portugal.,Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
11
|
Alexandre L, Costa RS, Santos LL, Henriques R. Mining Pre-Surgical Patterns Able to Discriminate Post-Surgical Outcomes in the Oncological Domain. IEEE J Biomed Health Inform 2021; 25:2421-2434. [PMID: 33687853 DOI: 10.1109/jbhi.2021.3064786] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Understanding the individualized risks of undertaking surgical procedures is essential to personalize preparatory, intervention and post-care protocols for minimizing post-surgical complications. This knowledge is key in oncology given the nature of interventions, the fragile profile of patients with comorbidities and cytotoxic drug exposure, and the possible cancer recurrence. Despite its relevance, the discovery of discriminative patterns of post-surgical risk is hampered by major challenges: i) the unique physiological and demographic profile of individuals, as well as their differentiated post-surgical care; ii) the high-dimensionality and heterogeneous nature of available biomedical data, combining non-identically distributed risk factors, clinical and molecular variables; iii) the need to generalize tumors have significant histopathological differences and individuals undertake unique surgical procedures; iv) the need to focus on non-trivial patterns of post-surgical risk, while guaranteeing their statistical significance and discriminative power; and v) the lack of interpretability and actionability of current approaches. Biclustering, the discovery of groups of individuals correlated on subsets of variables, has unique properties of interest, being positioned to satisfy the aforementioned challenges. In this context, this work proposes a structured view on why, when and how to apply biclustering to mine discriminative patterns of post-surgical risk with guarantees of usability, a subject remaining unexplored up to date. These patterns offer a comprehensive view on how the patient profile, cancer histopathology and entailed surgical procedures determine: i) post-surgical complications, ii) survival, and iii) hospitalization needs. The gathered results confirm the role of biclustering in comprehensively finding interpretable, actionable and statistically significant patterns of post-surgical risk. The found patterns are already assisting healthcare professionals at IPO-Porto to establish specialized pre-habilitation protocols and bedside care.
Collapse
|
12
|
Understanding the Impacts of the COVID-19 Pandemic on Public Transportation Travel Patterns in the City of Lisbon. SUSTAINABILITY 2021. [DOI: 10.3390/su13158342] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The ongoing COVID-19 pandemic is creating disruptive changes in urban mobility that may compromise the sustainability of the public transportation system. As a result, worldwide cities face the need to integrate data from different transportation modes to dynamically respond to changing conditions. This article combines statistical views with machine learning advances to comprehensively explore changing urban mobility dynamics within multimodal public transportation systems from user trip records. In particular, we retrieve discriminative traffic patterns with order-preserving coherence to model disruptions to demand expectations across geographies and show their utility to describe changing mobility dynamics with strict guarantees of statistical significance, interpretability and actionability. This methodology is applied to comprehensively trace the changes to the urban mobility patterns in the Lisbon city brought by the current COVID-19 pandemic. To this end, we consider passenger trip data gathered from the three major public transportation modes: subway, bus, and tramways. The gathered results comprehensively reveal novel travel patterns within the city, such as imbalanced demand distribution towards the city peripheries, going far beyond simplistic localized changes to the magnitude of traffic demand. This work offers a novel methodological contribution with a solid statistical ground for the spatiotemporal assessment of actionable mobility changes and provides essential insights for other cities and public transport operators facing mobility challenges alike.
Collapse
|
13
|
Maâtouk O, Ayadi W, Bouziri H, Duval B. Evolutionary Local Search Algorithm for the biclustering of gene expression data based on biological knowledge. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107177] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
14
|
Mandal K, Sarmah R, Bhattacharyya DK, Kalita JK, Borah B. Rank-preserving biclustering algorithm: a case study on miRNA breast cancer. Med Biol Eng Comput 2021; 59:989-1004. [PMID: 33840048 DOI: 10.1007/s11517-020-02271-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2019] [Accepted: 09/15/2020] [Indexed: 10/21/2022]
Abstract
Effective biomarkers aid in the early diagnosis and monitoring of breast cancer and thus play an important role in the treatment of patients suffering from the disease. Growing evidence indicates that alteration of expression levels of miRNA is one of the principal causes of cancer. We analyze breast cancer miRNA data to discover a list of biclusters as well as breast cancer miRNA biomarkers which can help to understand better this critical disease and take important clinical decisions for treatment and diagnosis. In this paper, we propose a pattern-based parallel biclustering algorithm termed Rank-Preserving Biclustering (RPBic). The key strategy is to identify rank-preserved rows under a subset of columns based on a modified version of all substrings common subsequence (ALCS) framework. To illustrate the effectiveness of the RPBic algorithm, we consider synthetic datasets and show that RPBic outperforms relevant biclustering algorithms in terms of relevance and recovery. For breast cancer data, we identify 68 biclusters and establish that they have strong clinical characteristics among the samples. The differentially co-expressed miRNAs are found to be involved in KEGG cancer related pathways. Moreover, we identify frequency-based biomarkers (hsa-miR-410, hsa-miR-483-5p) and network-based biomarkers (hsa-miR-454, hsa-miR-137) which we validate to have strong connectivity with breast cancer. The source code and the datasets used can be found at http://agnigarh.tezu.ernet.in/~rosy8/Bioinformatics_RPBic_Data.rar . Graphical Abstract.
Collapse
Affiliation(s)
- Koyel Mandal
- Department of Computer Science and Engineering, Tezpur University, Assam, India.
| | - Rosy Sarmah
- Department of Computer Science and Engineering, Tezpur University, Assam, India
| | | | - Jugal Kumar Kalita
- Department of Computer Science, University of Colorado, Colorado Springs, CO, USA
| | - Bhogeswar Borah
- Department of Computer Science and Engineering, Tezpur University, Assam, India
| |
Collapse
|
15
|
Hashem T, Rashidi L, Kulik L, Bailey J. PRESS: A personalised approach for mining top-k groups of objects with subspace similarity. DATA KNOWL ENG 2020. [DOI: 10.1016/j.datak.2020.101833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
16
|
Xie J, Ma A, Fennell A, Ma Q, Zhao J. It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data. Brief Bioinform 2020; 20:1449-1464. [PMID: 29490019 DOI: 10.1093/bib/bby014] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Revised: 01/16/2018] [Indexed: 12/12/2022] Open
Abstract
Biclustering is a powerful data mining technique that allows clustering of rows and columns, simultaneously, in a matrix-format data set. It was first applied to gene expression data in 2000, aiming to identify co-expressed genes under a subset of all the conditions/samples. During the past 17 years, tens of biclustering algorithms and tools have been developed to enhance the ability to make sense out of large data sets generated in the wake of high-throughput omics technologies. These algorithms and tools have been applied to a wide variety of data types, including but not limited to, genomes, transcriptomes, exomes, epigenomes, phenomes and pharmacogenomes. However, there is still a considerable gap between biclustering methodology development and comprehensive data interpretation, mainly because of the lack of knowledge for the selection of appropriate biclustering tools and further supporting computational techniques in specific studies. Here, we first deliver a brief introduction to the existing biclustering algorithms and tools in public domain, and then systematically summarize the basic applications of biclustering for biological data and more advanced applications of biclustering for biomedical data. This review will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency.
Collapse
|
17
|
Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F. Moving from Formal Towards Coherent Concept Analysis: Why, When and How. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7148255 DOI: 10.1007/978-3-030-45439-5_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Formal concept analysis has been largely applied to explore taxonomic relationships and derive ontologies from text collections. Despite its recognized relevance, it generally misses relevant concept associations and suffers from the need to learn from Boolean space models. Biclustering, the discovery of coherent concept associations (subsets of documents correlated on subsets of terms and topics), is here suggested to address the aforementioned problems. This work proposes a structured view on why, when and how to apply biclustering for concept analysis, a subject remaining largely unexplored up to date. Gathered results from a large text collection confirm the relevance of biclustering to find less-trivial, yet actionable and statistically significant concept associations.
Collapse
|
18
|
Bottarelli L, Bicego M, Denitto M, Di Pierro A, Farinelli A, Mengoni R. Biclustering with a quantum annealer. Soft comput 2018. [DOI: 10.1007/s00500-018-3034-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
19
|
Alzahrani M, Kuwahara H, Wang W, Gao X. Gracob: a novel graph-based constant-column biclustering method for mining growth phenotype data. Bioinformatics 2018; 33:2523-2531. [PMID: 28379298 PMCID: PMC5870648 DOI: 10.1093/bioinformatics/btx199] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2016] [Accepted: 04/03/2017] [Indexed: 11/24/2022] Open
Abstract
Motivation Growth phenotype profiling of genome-wide gene-deletion strains over stress conditions can offer a clear picture that the essentiality of genes depends on environmental conditions. Systematically identifying groups of genes from such high-throughput data that share similar patterns of conditional essentiality and dispensability under various environmental conditions can elucidate how genetic interactions of the growth phenotype are regulated in response to the environment. Results We first demonstrate that detecting such ‘co-fit’ gene groups can be cast as a less well-studied problem in biclustering, i.e. constant-column biclustering. Despite significant advances in biclustering techniques, very few were designed for mining in growth phenotype data. Here, we propose Gracob, a novel, efficient graph-based method that casts and solves the constant-column biclustering problem as a maximal clique finding problem in a multipartite graph. We compared Gracob with a large collection of widely used biclustering methods that cover different types of algorithms designed to detect different types of biclusters. Gracob showed superior performance on finding co-fit genes over all the existing methods on both a variety of synthetic data sets with a wide range of settings, and three real growth phenotype datasets for E. coli, proteobacteria and yeast. Availability and Implementation Our program is freely available for download at http://sfb.kaust.edu.sa/Pages/Software.aspx. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Majed Alzahrani
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMCE) Division, Thuwal, 23955-6900, Saudi Arabia
| | - Hiroyuki Kuwahara
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMCE) Division, Thuwal, 23955-6900, Saudi Arabia
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMCE) Division, Thuwal, 23955-6900, Saudi Arabia
| |
Collapse
|
20
|
Houari A, Ayadi W, Ben Yahia S. A new FCA-based method for identifying biclusters in gene expression data. INT J MACH LEARN CYB 2018. [DOI: 10.1007/s13042-018-0794-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
21
|
Henriques R, Madeira SC. BSig: evaluating the statistical significance of biclustering solutions. Data Min Knowl Discov 2017. [DOI: 10.1007/s10618-017-0521-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
22
|
Henriques R, Ferreira FL, Madeira SC. BicPAMS: software for biological data analysis with pattern-based biclustering. BMC Bioinformatics 2017; 18:82. [PMID: 28153040 PMCID: PMC5290636 DOI: 10.1186/s12859-017-1493-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 01/21/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Biclustering has been largely applied for the unsupervised analysis of biological data, being recognised today as a key technique to discover putative modules in both expression data (subsets of genes correlated in subsets of conditions) and network data (groups of coherently interconnected biological entities). However, given its computational complexity, only recent breakthroughs on pattern-based biclustering enabled efficient searches without the restrictions that state-of-the-art biclustering algorithms place on the structure and homogeneity of biclusters. As a result, pattern-based biclustering provides the unprecedented opportunity to discover non-trivial yet meaningful biological modules with putative functions, whose coherency and tolerance to noise can be tuned and made problem-specific. METHODS To enable the effective use of pattern-based biclustering by the scientific community, we developed BicPAMS (Biclustering based on PAttern Mining Software), a software that: 1) makes available state-of-the-art pattern-based biclustering algorithms (BicPAM (Henriques and Madeira, Alg Mol Biol 9:27, 2014), BicNET (Henriques and Madeira, Alg Mol Biol 11:23, 2016), BicSPAM (Henriques and Madeira, BMC Bioinforma 15:130, 2014), BiC2PAM (Henriques and Madeira, Alg Mol Biol 11:1-30, 2016), BiP (Henriques and Madeira, IEEE/ACM Trans Comput Biol Bioinforma, 2015), DeBi (Serin and Vingron, AMB 6:1-12, 2011) and BiModule (Okada et al., IPSJ Trans Bioinf 48(SIG5):39-48, 2007)); 2) consistently integrates their dispersed contributions; 3) further explores additional accuracy and efficiency gains; and 4) makes available graphical and application programming interfaces. RESULTS Results on both synthetic and real data confirm the relevance of BicPAMS for biological data analysis, highlighting its essential role for the discovery of putative modules with non-trivial yet biologically significant functions from expression and network data. CONCLUSIONS BicPAMS is the first biclustering tool offering the possibility to: 1) parametrically customize the structure, coherency and quality of biclusters; 2) analyze large-scale biological networks; and 3) tackle the restrictive assumptions placed by state-of-the-art biclustering algorithms. These contributions are shown to be key for an adequate, complete and user-assisted unsupervised analysis of biological data. SOFTWARE BicPAMS and its tutorial available in http://www.bicpams.com .
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | | | - Sara C. Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
23
|
Veroneze R, Banerjee A, Von Zuben FJ. Enumerating all maximal biclusters in numerical datasets. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2016.10.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
24
|
Henriques R, Madeira SC. BiC2PAM: constraint-guided biclustering for biological data analysis with domain knowledge. Algorithms Mol Biol 2016; 11:23. [PMID: 27651825 PMCID: PMC5024481 DOI: 10.1186/s13015-016-0085-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Accepted: 08/16/2016] [Indexed: 11/10/2022] Open
Abstract
Background Biclustering has been largely used in biological data analysis, enabling the discovery of putative functional modules from omic and network data. Despite the recognized importance of incorporating domain knowledge to guide biclustering and guarantee a focus on relevant and non-trivial biclusters, this possibility has not yet been comprehensively addressed. This results from the fact that the majority of existing algorithms are only able to deliver sub-optimal solutions with restrictive assumptions on the structure, coherency and quality of biclustering solutions, thus preventing the up-front satisfaction of knowledge-driven constraints. Interestingly, in recent years, a clearer understanding of the synergies between pattern mining and biclustering gave rise to a new class of algorithms, termed as pattern-based biclustering algorithms. These algorithms, able to efficiently discover flexible biclustering solutions with optimality guarantees, are thus positioned as good candidates for knowledge incorporation. In this context, this work aims to bridge the current lack of solid views on the use of background knowledge to guide (pattern-based) biclustering tasks. Methods This work extends (pattern-based) biclustering algorithms to guarantee the satisfiability of constraints derived from background knowledge and to effectively explore efficiency gains from their incorporation. In this context, we first show the relevance of constraints with succinct, (anti-)monotone and convertible properties for the analysis of expression data and biological networks. We further show how pattern-based biclustering algorithms can be adapted to effectively prune of the search space in the presence of such constraints, as well as be guided in the presence of biological annotations. Relying on these contributions, we propose BiClustering with Constraints using PAttern Mining (BiC2PAM), an extension of BicPAM and BicNET biclustering algorithms. Results Experimental results on biological data demonstrate the importance of incorporating knowledge within biclustering to foster efficiency and enable the discovery of non-trivial biclusters with heightened biological relevance. Conclusions This work provides the first comprehensive view and sound algorithm for biclustering biological data with constraints derived from user expectations, knowledge repositories and/or literature.
Collapse
|
25
|
Moore EJ, Bourlai T. Expectation Maximization of Frequent Patterns, a Specific, Local, Pattern-Based Biclustering Algorithm for Biological Datasets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:812-824. [PMID: 26701897 DOI: 10.1109/tcbb.2015.2510011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Currently, binary biclustering algorithms are too slow and non-specific to handle biological datasets that have a large number of attributes, which is essential for the computational biology problem of microarray analysis. Specialized computers may be needed to execute an algorithm, and may fail to produce a solution, due to its large resource needs. The biclusters also include too many false positives, the type I error, which hinders biological discovery. We propose an algorithm that can analyze datasets with a large attribute set at different densities, and can operate on a laptop, which makes it accessible to practitioners. EMFP produces biclusters that have a very low Root Mean Squared Error and false positive rate, with very few type II errors. Our binary biclustering algorithm is a hybrid, axis-parallel, pattern-based algorithm that finds multiple, non-overlapping, near-constant, deterministic, binary submatricies, with a variable confidence threshold, and the novel use of local density comparisons versus the standard global threshold. EMFP introduces a new, and intuitive way to calculate internal measures for binary biclustering methods. We also introduce a framework to ease comparison with other algorithms, and compare to both binary and general biclustering algorithms using two real, and 80 synthetic databases.
Collapse
|
26
|
Henriques R, Madeira SC. BicNET: Flexible module discovery in large-scale biological networks using biclustering. Algorithms Mol Biol 2016; 11:14. [PMID: 27213009 PMCID: PMC4875761 DOI: 10.1186/s13015-016-0074-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Accepted: 04/22/2016] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Despite the recognized importance of module discovery in biological networks to enhance our understanding of complex biological systems, existing methods generally suffer from two major drawbacks. First, there is a focus on modules where biological entities are strongly connected, leading to the discovery of trivial/well-known modules and to the inaccurate exclusion of biological entities with subtler yet relevant roles. Second, there is a generalized intolerance towards different forms of noise, including uncertainty associated with less-studied biological entities (in the context of literature-driven networks) and experimental noise (in the context of data-driven networks). Although state-of-the-art biclustering algorithms are able to discover modules with varying coherency and robustness to noise, their application for the discovery of non-dense modules in biological networks has been poorly explored and it is further challenged by efficiency bottlenecks. METHODS This work proposes Biclustering NETworks (BicNET), a biclustering algorithm to discover non-trivial yet coherent modules in weighted biological networks with heightened efficiency. Three major contributions are provided. First, we motivate the relevance of discovering network modules given by constant, symmetric, plaid and order-preserving biclustering models. Second, we propose an algorithm to discover these modules and to robustly handle noisy and missing interactions. Finally, we provide new searches to tackle time and memory bottlenecks by effectively exploring the inherent structural sparsity of network data. RESULTS Results in synthetic network data confirm the soundness, efficiency and superiority of BicNET. The application of BicNET on protein interaction and gene interaction networks from yeast, E. coli and Human reveals new modules with heightened biological significance. CONCLUSIONS BicNET is, to our knowledge, the first method enabling the efficient unsupervised analysis of large-scale network data for the discovery of coherent modules with parameterizable homogeneity.
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | - Sara C. Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
27
|
Tu X, Wang Y, Zhang M, Wu J. Using Formal Concept Analysis to Identify Negative Correlations in Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:380-391. [PMID: 27045834 DOI: 10.1109/tcbb.2015.2443805] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Recently, many biological studies reported that two groups of genes tend to show negatively correlated or opposite expression tendency in many biological processes or pathways. The negative correlation between genes may imply an important biological mechanism. In this study, we proposed a FCA-based negative correlation algorithm (NCFCA) that can effectively identify opposite expression tendency between two gene groups in gene expression data. After applying it to expression data of cell cycle-regulated genes in yeast, we found that six minichromosome maintenance family genes showed the opposite changing tendency with eight core histone family genes. Furthermore, we confirmed that the negative correlation expression pattern between these two families may be conserved in the cell cycle. Finally, we discussed the reasons underlying the negative correlation of six minichromosome maintenance (MCM) family genes with eight core histone family genes. Our results revealed that negative correlation is an important and potential mechanism that maintains the balance of biological systems by repressing some genes while inducing others. It can thus provide new understanding of gene expression and regulation, the causes of diseases, etc.
Collapse
|
28
|
Pontes B, Giráldez R, Aguilar-Ruiz JS. Biclustering on expression data: A review. J Biomed Inform 2015; 57:163-80. [PMID: 26160444 DOI: 10.1016/j.jbi.2015.06.028] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2015] [Revised: 06/22/2015] [Accepted: 06/30/2015] [Indexed: 11/28/2022]
Abstract
Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.
Collapse
Affiliation(s)
- Beatriz Pontes
- Department of Languages and Computer Systems, University of Seville, Seville, Spain.
| | - Raúl Giráldez
- School of Engineering, Pablo de Olavide University, Seville, Spain.
| | | |
Collapse
|
29
|
Henriques R, Madeira SC. Biclustering with Flexible Plaid Models to Unravel Interactions between Biological Processes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:738-752. [PMID: 26357312 DOI: 10.1109/tcbb.2014.2388206] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Genes can participate in multiple biological processes at a time and thus their expression can be seen as a composition of the contributions from the active processes. Biclustering under a plaid assumption allows the modeling of interactions between transcriptional modules or biclusters (subsets of genes with coherence across subsets of conditions) by assuming an additive composition of contributions in their overlapping areas. Despite the biological interest of plaid models, few biclustering algorithms consider plaid effects and, when they do, they place restrictions on the allowed types and structures of biclusters, and suffer from robustness problems by seizing exact additive matchings. We propose BiP (Biclustering using Plaid models), a biclustering algorithm with relaxations to allow expression levels to change in overlapping areas according to biologically meaningful assumptions (weighted and noise-tolerant composition of contributions). BiP can be used over existing biclustering solutions (seizing their benefits) as it is able to recover excluded areas due to unaccounted plaid effects and detect noisy areas non-explained by a plaid assumption, thus producing an explanatory model of overlapping transcriptional activity. Experiments on synthetic data support BiP's efficiency and effectiveness. The learned models from expression data unravel meaningful and non-trivial functional interactions between biological processes associated with putative regulatory modules.
Collapse
|
30
|
Henriques R, Madeira SC. Pattern-Based Biclustering with Constraints for Gene Expression Data Analysis. PROGRESS IN ARTIFICIAL INTELLIGENCE 2015. [DOI: 10.1007/978-3-319-23485-4_34] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|