1
|
Waqar M, Ayub M. A personalized reinforcement learning recommendation algorithm using bi-clustering techniques. PLoS One 2025; 20:e0315533. [PMID: 39977407 PMCID: PMC11841880 DOI: 10.1371/journal.pone.0315533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 11/26/2024] [Indexed: 02/22/2025] Open
Abstract
Recommender systems have become a core component of various online platforms, helping users get relevant information from the abundant digital data. Traditional RSs often generate static recommendations, which may not adapt well to changing user preferences. To address this problem, we propose a novel reinforcement learning (RL) recommendation algorithm that can give personalized recommendations by adapting to changing user preferences. However, a significant drawback of RL-based recommendation systems is that they are computationally expensive. Moreover, these systems often fail to extract local patterns residing within dataset which may result in generation of low quality recommendations. The proposed work utilizes biclustering technique to create an efficient environment for RL agents, thus, reducing computation cost and enabling the generation of dynamic recommendations. Additionally, biclustering is used to find locally associated patterns in the dataset, which further improves the efficiency of the RL agent's learning process. The proposed work experiments eight state-of-the-art biclustering algorithms to identify the appropriate biclustering algorithm for the given recommendation task. This innovative integration of biclustering and reinforcement learning addresses key gaps in existing literature. Moreover, we introduced a novel strategy to predict item ratings within the RL framework. The validity of the proposed algorithm is evaluated on three datasets of movies domain, namely, ML100K, ML-latest-small and FilmTrust. These diverse datasets were chosen to ensure reliable examination across various scenarios. As per the dynamic nature of RL, some specific evaluation metrics like personalization, diversity, intra-list similarity and novelty are used to measure the diversity of recommendations. This investigation is motivated by the need for recommender systems that can dynamically adjust to changes in customer preferences. Results show that our proposed algorithm showed promising results when compared with existing state-of-the-art recommendation techniques.
Collapse
Affiliation(s)
- Muhammad Waqar
- Department of Software Engineering, University of Engineering and Technology, Taxila, Pakistan
| | - Mubbashir Ayub
- Department of Software Engineering, University of Engineering and Technology, Taxila, Pakistan
| |
Collapse
|
2
|
Zhang J, Wei X, Zhao C, Yang H. Protocol to infer and analyze miRNA sponge modules in heterogeneous data using miRSM 2.0. STAR Protoc 2024; 5:103317. [PMID: 39292559 PMCID: PMC11424997 DOI: 10.1016/j.xpro.2024.103317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Revised: 08/06/2024] [Accepted: 08/23/2024] [Indexed: 09/20/2024] Open
Abstract
MicroRNA (miRNA) sponges synergistically modulate physiological and pathological processes in the form of modules or clusters. Here, we present a protocol for inferring and analyzing miRNA sponge modules in heterogeneous data using the R package miRSM 2.0. We describe steps for identifying gene modules, inferring miRNA sponge modules at multi-sample and single-sample levels, and performing modular analysis. From the perspective of computational biology, miRSM 2.0 has the potential to advance our understanding of the role of miRNA sponges in diseases. For complete details on the use and execution of this protocol, please refer to Zhang et al.1,2,3.
Collapse
Affiliation(s)
- Junpeng Zhang
- School of Engineering, Dali University, Yunnan 671003, China.
| | - Xuemei Wei
- School of Engineering, Dali University, Yunnan 671003, China
| | - Chunwen Zhao
- School of Engineering, Dali University, Yunnan 671003, China
| | - Haolin Yang
- School of Engineering, Dali University, Yunnan 671003, China
| |
Collapse
|
3
|
Pividori M, Sadeeq S, Krishnan A, Stranger BE, Gignoux CR. Uncovering hidden gene-trait patterns through biclustering analysis of the UK Biobank. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.08.622657. [PMID: 39605717 PMCID: PMC11601405 DOI: 10.1101/2024.11.08.622657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
The growing availability of genome-wide association studies (GWAS) and large-scale biobanks provides an unprecedented opportunity to explore the genetic basis of complex traits and diseases. However, with this vast amount of data comes the challenge of interpreting numerous associations across thousands of traits, especially given the high polygenicity and pleiotropy underlying complex phenotypes. Traditional clustering methods, which identify global patterns in data, lack the resolution to capture overlapping associations relevant to subsets of traits or genes. Consequently, there is a critical need for innovative analytic approaches capable of revealing local, biologically meaningful patterns that could advance our understanding of trait comorbidities and gene-trait interactions. Here, we applied BiBit, a biclustering algorithm, to transcriptome-wide association study (TWAS) results from PhenomeXcan, a large resource of gene-trait associations derived from the UK Biobank. BiBit allows simultaneous grouping of traits and genes, identifying biclusters that represent local, overlapping associations. Our analyses uncovered biologically interpretable patterns, including asthma-related biclusters enriched for immune-related gene sets, connections between eye traits and blood pressure, and associations between dietary traits, high cholesterol, and specific loci on chromosome 19. These biclusters highlight gene-trait relationships and patterns of trait co-occurrence that may otherwise be obscured by traditional methods. Our findings demonstrate that biclustering can provide a nuanced view of the genetic architecture of complex traits, offering insights into pleiotropy and disease mechanisms. By enabling the exploration of complex, overlapping patterns within biobank-scale datasets, this approach provides a valuable framework for advancing research on genetic associations, comorbidities, and polygenic traits.
Collapse
Affiliation(s)
- Milton Pividori
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Suraju Sadeeq
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Barbara E. Stranger
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Christopher R. Gignoux
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO 80045, USA; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| |
Collapse
|
4
|
Castanho EN, Aidos H, Madeira SC. Biclustering data analysis: a comprehensive survey. Brief Bioinform 2024; 25:bbae342. [PMID: 39007596 PMCID: PMC11247412 DOI: 10.1093/bib/bbae342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 05/16/2024] [Accepted: 07/01/2024] [Indexed: 07/16/2024] Open
Abstract
Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| |
Collapse
|
5
|
Castanho EN, Lobo JP, Henriques R, Madeira SC. G-bic: generating synthetic benchmarks for biclustering. BMC Bioinformatics 2023; 24:457. [PMID: 38053078 PMCID: PMC10698934 DOI: 10.1186/s12859-023-05587-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 11/28/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Biclustering is increasingly used in biomedical data analysis, recommendation tasks, and text mining domains, with hundreds of biclustering algorithms proposed. When assessing the performance of these algorithms, more than real datasets are required as they do not offer a solid ground truth. Synthetic data surpass this limitation by producing reference solutions to be compared with the found patterns. However, generating synthetic datasets is challenging since the generated data must ensure reproducibility, pattern representativity, and real data resemblance. RESULTS We propose G-Bic, a dataset generator conceived to produce synthetic benchmarks for the normative assessment of biclustering algorithms. Beyond expanding on aspects of pattern coherence, data quality, and positioning properties, it further handles specificities related to mixed-type datasets and time-series data.G-Bic has the flexibility to replicate real data regularities from diverse domains. We provide the default configurations to generate reproducible benchmarks to evaluate and compare diverse aspects of biclustering algorithms. Additionally, we discuss empirical strategies to simulate the properties of real data. CONCLUSION G-Bic is a parametrizable generator for biclustering analysis, offering a solid means to assess biclustering solutions according to internal and external metrics robustly.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal.
| | - João P Lobo
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| | - Rui Henriques
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1900-001, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| |
Collapse
|
6
|
Sriwastava BK, Halder AK, Basu S, Chakraborti T. RUBic: rapid unsupervised biclustering. BMC Bioinformatics 2023; 24:435. [PMID: 37974081 PMCID: PMC10655409 DOI: 10.1186/s12859-023-05534-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Accepted: 10/16/2023] [Indexed: 11/19/2023] Open
Abstract
Biclustering of biologically meaningful binary information is essential in many applications related to drug discovery, like protein-protein interactions and gene expressions. However, for robust performance in recently emerging large health datasets, it is important for new biclustering algorithms to be scalable and fast. We present a rapid unsupervised biclustering (RUBic) algorithm that achieves this objective with a novel encoding and search strategy. RUBic significantly reduces the computational overhead on both synthetic and experimental datasets shows significant computational benefits, with respect to several state-of-the-art biclustering algorithms. In 100 synthetic binary datasets, our method took [Formula: see text] s to extract 494,872 biclusters. In the human PPI database of size [Formula: see text], our method generates 1840 biclusters in [Formula: see text] s. On a central nervous system embryonic tumor gene expression dataset of size 712,940, our algorithm takes 101 min to produce 747,069 biclusters, while the recent competing algorithms take significantly more time to produce the same result. RUBic is also evaluated on five different gene expression datasets and shows significant speed-up in execution time with respect to existing approaches to extract significant KEGG-enriched bi-clustering. RUBic can operate on two modes, base and flex, where base mode generates maximal biclusters and flex mode generates less number of clusters and faster based on their biological significance with respect to KEGG pathways. The code is available at ( https://github.com/CMATERJU-BIOINFO/RUBic ) for academic use only.
Collapse
Affiliation(s)
- Brijesh K Sriwastava
- Computer Science and Engineering Department, Government College of Engineering and Leather Technology, Kolkata, India
| | - Anup Kumar Halder
- Faculty of Mathematics and Information Sciences, Warsaw University of Technology, Warsaw, Poland
- CeNT, University of Warsaw, Warsaw, Poland
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India.
| | | |
Collapse
|
7
|
Chu HM, Kong XZ, Liu JX, Zheng CH, Zhang H. A New Binary Biclustering Algorithm Based on Weight Adjacency Difference Matrix for Analyzing Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2802-2809. [PMID: 37285246 DOI: 10.1109/tcbb.2023.3283801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Biclustering algorithms are essential for processing gene expression data. However, to process the dataset, most biclustering algorithms require preprocessing the data matrix into a binary matrix. Regrettably, this type of preprocessing may introduce noise or cause information loss in the binary matrix, which would reduce the biclustering algorithm's ability to effectively obtain the optimal biclusters. In this paper, we propose a new preprocessing method named Mean-Standard Deviation (MSD) to resolve the problem. Additionally, we introduce a new biclustering algorithm called Weight Adjacency Difference Matrix Binary Biclustering (W-AMBB) to effectively process datasets containing overlapping biclusters. The basic idea is to create a weighted adjacency difference matrix by applying weights to a binary matrix that is derived from the data matrix. This allows us to identify genes with significant associations in sample data by efficiently identifying similar genes that respond to specific conditions. Furthermore, the performance of the W-AMBB algorithm was tested on both synthetic and real datasets and compared with other classical biclustering methods. The experiment results demonstrate that the W-AMBB algorithm is significantly more robust than the compared biclustering methods on the synthetic dataset. Additionally, the results of the GO enrichment analysis show that the W-AMBB method possesses biological significance on real datasets.
Collapse
|
8
|
Chu HM, Liu JX, Zhang K, Zheng CH, Wang J, Kong XZ. A binary biclustering algorithm based on the adjacency difference matrix for gene expression data analysis. BMC Bioinformatics 2022; 23:381. [PMID: 36123637 PMCID: PMC9484244 DOI: 10.1186/s12859-022-04842-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Accepted: 07/14/2022] [Indexed: 11/20/2022] Open
Abstract
Biclustering algorithm is an effective tool for processing gene expression datasets. There are two kinds of data matrices, binary data and non-binary data, which are processed by biclustering method. A binary matrix is usually converted from pre-processed gene expression data, which can effectively reduce the interference from noise and abnormal data, and is then processed using a biclustering algorithm. However, biclustering algorithms of dealing with binary data have a poor balance between running time and performance. In this paper, we propose a new biclustering algorithm called the Adjacency Difference Matrix Binary Biclustering algorithm (AMBB) for dealing with binary data to address the drawback. The AMBB algorithm constructs the adjacency matrix based on the adjacency difference values, and the submatrix obtained by continuously updating the adjacency difference matrix is called a bicluster. The adjacency matrix allows for clustering of gene that undergo similar reactions under different conditions into clusters, which is important for subsequent genes analysis. Meanwhile, experiments on synthetic and real datasets visually demonstrate that the AMBB algorithm has high practicability.
Collapse
Affiliation(s)
- He-Ming Chu
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Ke Zhang
- Department of Oncology, Rizhao People's Hospital, Rizhao, 276826, China.
| | - Chun-Hou Zheng
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Xiang-Zhen Kong
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China.
| |
Collapse
|
9
|
Chang H, Zhang H, Zhang T, Su L, Qin QM, Li G, Li X, Wang L, Zhao T, Zhao E, Zhao H, Liu Y, Stacey G, Xu D. A Multi-Level Iterative Bi-Clustering Method for Discovering miRNA Co-regulation Network of Abiotic Stress Tolerance in Soybeans. FRONTIERS IN PLANT SCIENCE 2022; 13:860791. [PMID: 35463453 PMCID: PMC9021755 DOI: 10.3389/fpls.2022.860791] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Accepted: 02/24/2022] [Indexed: 06/14/2023]
Abstract
Although growing evidence shows that microRNA (miRNA) regulates plant growth and development, miRNA regulatory networks in plants are not well understood. Current experimental studies cannot characterize miRNA regulatory networks on a large scale. This information gap provides an excellent opportunity to employ computational methods for global analysis and generate valuable models and hypotheses. To address this opportunity, we collected miRNA-target interactions (MTIs) and used MTIs from Arabidopsis thaliana and Medicago truncatula to predict homologous MTIs in soybeans, resulting in 80,235 soybean MTIs in total. A multi-level iterative bi-clustering method was developed to identify 483 soybean miRNA-target regulatory modules (MTRMs). Furthermore, we collected soybean miRNA expression data and corresponding gene expression data in response to abiotic stresses. By clustering these data, 37 MTRMs related to abiotic stresses were identified, including stress-specific MTRMs and shared MTRMs. These MTRMs have gene ontology (GO) enrichment in resistance response, iron transport, positive growth regulation, etc. Our study predicts soybean MTRMs and miRNA-GO networks under different stresses, and provides miRNA targeting hypotheses for experimental analyses. The method can be applied to other biological processes and other plants to elucidate miRNA co-regulation mechanisms.
Collapse
Affiliation(s)
- Haowu Chang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Hao Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Tianyue Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Lingtao Su
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, China
| | - Qing-Ming Qin
- College of Plant Sciences and Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Jilin, China
| | - Guihua Li
- College of Plant Sciences and Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Jilin, China
| | - Xueqing Li
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Li Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Tianheng Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Enshuang Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Hengyi Zhao
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
| | - Yuanning Liu
- Key Laboratory of Symbol Computation and Knowledge Engineering, College of Computer Science and Technology, Ministry of Education, Jilin University, Jilin, China
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Gary Stacey
- Division of Plant Sciences and Technology, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Department of Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| |
Collapse
|
10
|
Liu H, Zou J, Ravishanker N. Biclustering high‐frequency financial time series based on information theory. Stat Anal Data Min 2022. [DOI: 10.1002/sam.11581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Haitao Liu
- Data Science Program Worcester Polytechnic Institute Worcester MA USA
| | - Jian Zou
- Data Science Program Worcester Polytechnic Institute Worcester MA USA
- Department of Mathematical Sciences Worcester Polytechnic Institute Worcester MA USA
| | | |
Collapse
|
11
|
Qian S, Liu H, Yuan X, Wei W, Chen S, Yan H. Row and Column Structure-Based Biclustering for Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1117-1129. [PMID: 32894722 DOI: 10.1109/tcbb.2020.3022085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Due to the development of high-throughput technologies for gene analysis, the biclustering method has attracted much attention. However, existing methods have problems with high time and space complexity. This paper proposes a biclustering method, called Row and Column Structure-based Biclustering (RCSBC), with low time and space complexity to find checkerboard patterns within microarray data. First, the paper describes the structure of bicluster by using the structure of rows and columns. Second, the paper chooses the representative rows and columns with two algorithms. Finally, the gene expression data are biclustered on the space spanned by representative rows and columns. To the best of our knowledge, this paper is the first to exploit the relationship between the row/column structure of a gene expression matrix and the structure of biclusters. Both the synthetic datasets and the real-life gene expression datasets are used to validate the effectiveness of our method. It can be seen from the experiment results that the RCSBC outperforms the state-of-the-art algorithms both on clustering accuracy and time/space complexity. This study offers new insights into biclustering the large-scale gene expression data without loading the whole data into memory.
Collapse
|
12
|
Baruah B, Dutta MP, Bhattacharyya DK. Identification of ESCC potential biomarkers using biclustering algorithms. GENE REPORTS 2022. [DOI: 10.1016/j.genrep.2022.101563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
13
|
Mandal K, Sarmah R, Bhattacharyya DK, Kalita JK, Borah B. Rank-preserving biclustering algorithm: a case study on miRNA breast cancer. Med Biol Eng Comput 2021; 59:989-1004. [PMID: 33840048 DOI: 10.1007/s11517-020-02271-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2019] [Accepted: 09/15/2020] [Indexed: 10/21/2022]
Abstract
Effective biomarkers aid in the early diagnosis and monitoring of breast cancer and thus play an important role in the treatment of patients suffering from the disease. Growing evidence indicates that alteration of expression levels of miRNA is one of the principal causes of cancer. We analyze breast cancer miRNA data to discover a list of biclusters as well as breast cancer miRNA biomarkers which can help to understand better this critical disease and take important clinical decisions for treatment and diagnosis. In this paper, we propose a pattern-based parallel biclustering algorithm termed Rank-Preserving Biclustering (RPBic). The key strategy is to identify rank-preserved rows under a subset of columns based on a modified version of all substrings common subsequence (ALCS) framework. To illustrate the effectiveness of the RPBic algorithm, we consider synthetic datasets and show that RPBic outperforms relevant biclustering algorithms in terms of relevance and recovery. For breast cancer data, we identify 68 biclusters and establish that they have strong clinical characteristics among the samples. The differentially co-expressed miRNAs are found to be involved in KEGG cancer related pathways. Moreover, we identify frequency-based biomarkers (hsa-miR-410, hsa-miR-483-5p) and network-based biomarkers (hsa-miR-454, hsa-miR-137) which we validate to have strong connectivity with breast cancer. The source code and the datasets used can be found at http://agnigarh.tezu.ernet.in/~rosy8/Bioinformatics_RPBic_Data.rar . Graphical Abstract.
Collapse
Affiliation(s)
- Koyel Mandal
- Department of Computer Science and Engineering, Tezpur University, Assam, India.
| | - Rosy Sarmah
- Department of Computer Science and Engineering, Tezpur University, Assam, India
| | | | - Jugal Kumar Kalita
- Department of Computer Science, University of Colorado, Colorado Springs, CO, USA
| | - Bhogeswar Borah
- Department of Computer Science and Engineering, Tezpur University, Assam, India
| |
Collapse
|
14
|
Zolotareva O, Khakabimamaghani S, Isaeva OI, Chervontseva Z, Savchik A, Ester M. Identification Of Differentially Expressed Gene Modules In Heterogeneous Diseases. Bioinformatics 2020; 37:1691-1698. [PMID: 33325506 DOI: 10.1093/bioinformatics/btaa1038] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2020] [Revised: 11/25/2020] [Accepted: 12/02/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Identification of differentially expressed genes is necessary for unraveling disease pathogenesis. This task is complicated by the fact that many diseases are heterogeneous at the molecular level and samples representing distinct disease subtypes may demonstrate different patterns of dysregulation. Biclustering methods are capable of identifying genes that follow a similar expression pattern only in a subset of samples and hence can consider disease heterogeneity. However, identifying biologically significant and reproducible sets of genes and samples remains challenging for the existing tools. Many recent studies have shown that the integration of gene expression and protein interaction data improves the robustness of prediction and classification and advances biomarker discovery. RESULTS Here we present DESMOND, a new method for identification of Differentially ExpreSsed gene MOdules iN Diseases. DESMOND performs network-constrained biclustering on gene expression data and identifies gene modules - connected sets of genes up- or down-regulated in subsets of samples. We applied DESMOND on expression profiles of samples from two large breast cancer cohorts and have shown that the capability of DESMOND to incorporate protein interactions allows identifying the biologically meaningful gene and sample subsets and improves the reproducibility of the results. AVAILABILITY https://github.com/ozolotareva/DESMOND. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Olga Zolotareva
- International Research Training Group" Computational Methods for the Analysis of the Diversity and Dynamics of Genomes" and Genome Informatics, Faculty of Technology and Center for Biotechnology, Bielefeld University, Germany.,Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Germany
| | | | - Olga I Isaeva
- Center of Life Sciences, Skolkovo Institute of Science and Technology, Russia.,BostonGene LLC, Lincoln, Massachusetts, USA.,Divisions of Molecular Oncology & Immunology; Tumor Biology & Immunology; Molecular Carcinogenesis, The Netherlands Cancer Institute, Amsterdam, the Netherlands
| | - Zoe Chervontseva
- Center of Life Sciences, Skolkovo Institute of Science and Technology, Russia.,A.A.Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences (RAS), Moscow, Russia
| | - Alexey Savchik
- A.A.Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences (RAS), Moscow, Russia
| | - Martin Ester
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada.,Vancouver Prostate Centre, Vancouver, BC, Canada
| |
Collapse
|
15
|
Zhang T, Chang H, Zhang B, Liu S, Zhao T, Zhao E, Zhao H, Zhang H. Transboundary Pathogenic microRNA Analysis Framework for Crop Fungi Driven by Biological Big Data and Artificial Intelligence Model. Comput Biol Chem 2020; 89:107401. [PMID: 33068919 DOI: 10.1016/j.compbiolchem.2020.107401] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Revised: 09/19/2020] [Accepted: 10/05/2020] [Indexed: 12/13/2022]
Abstract
Plant fungal diseases have been affecting the world's agricultural production and economic levels for a long time, such as rice blast, gray tomato mold, potato late blight etc. Recent studies have shown that fungal pathogens transmit microRNA as an effector to host plants for infection. However, bioassay-based verification analysis is time-consuming and challenging, and it is difficult to analyze from a global perspective. With the accumulation of fungal and plant-related data, data analysis methods can be used to analyze pathogenic fungal microRNA further. Based on the microRNA expression data of fungal pathogens infecting plants before and after, this paper discusses the selection strategy of sample data, the extraction strategy of pathogenic fungal microRNA, the prediction strategy of a fungal pathogenic microRNA target gene, the bicluster-based fungal pathogenic microRNA functional analysis strategy and experimental verification methods. A general analysis pipeline based on machine learning and bicluster-based function module was proposed for plant-fungal pathogenic microRNA.The pipeline proposed in this paper is applied to the infection process of Magnaporthe oryzae and the infection process of potato late blight. It has been verified to prove the feasibility of the pipeline. It can be extended to other relevant crop pathogen research, providing a new idea for fungal research on plant diseases. It can be used as a reference for understanding the interaction between fungi and plants.
Collapse
Affiliation(s)
- Tianyue Zhang
- College of Computer Science and Technology, Jilin University, China
| | - Haowu Chang
- College of Computer Science and Technology, Jilin University, China
| | - Borui Zhang
- Columbia Independent School, Columbia, MO, USA
| | - Sifei Liu
- College of Computer Science and Technology, Jilin University, China
| | - Tianheng Zhao
- College of Computer Science and Technology, Jilin University, China
| | - Enshuang Zhao
- College of Computer Science and Technology, Jilin University, China
| | - Hengyi Zhao
- College of Computer Science and Technology, Jilin University, China
| | - Hao Zhang
- College of Computer Science and Technology, Jilin University, China.
| |
Collapse
|
16
|
Li Z, Chang C, Kundu S, Long Q. Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics 2020; 21:610-624. [PMID: 30596887 PMCID: PMC7307984 DOI: 10.1093/biostatistics/kxy081] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Revised: 09/18/2018] [Accepted: 11/21/2018] [Indexed: 12/13/2022] Open
Abstract
Biclustering techniques can identify local patterns of a data matrix by clustering feature space and sample space at the same time. Various biclustering methods have been proposed and successfully applied to analysis of gene expression data. While existing biclustering methods have many desirable features, most of them are developed for continuous data and few of them can efficiently handle -omics data of various types, for example, binomial data as in single nucleotide polymorphism data or negative binomial data as in RNA-seq data. In addition, none of existing methods can utilize biological information such as those from functional genomics or proteomics. Recent work has shown that incorporating biological information can improve variable selection and prediction performance in analyses such as linear regression and multivariate analysis. In this article, we propose a novel Bayesian biclustering method that can handle multiple data types including Gaussian, Binomial, and Negative Binomial. In addition, our method uses a Bayesian adaptive structured shrinkage prior that enables feature selection guided by existing biological information. Our simulation studies and application to multi-omics datasets demonstrate robust and superior performance of the proposed method, compared to other existing biclustering methods.
Collapse
Affiliation(s)
- Ziyi Li
- Department of Biostatistics and Bioinformatics, Emory University, 1518 Clifton Road, NE, Atlanta, GA, USA
| | - Changgee Chang
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, PA, USA
| | - Suprateek Kundu
- Department of Biostatistics and Bioinformatics, Emory University, 1518 Clifton Road, NE, Atlanta, GA, USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, PA, USA
| |
Collapse
|
17
|
Knowledge Visualizations to Inform Decision Making for Improving Food Accessibility and Reducing Obesity Rates in the United States. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:ijerph17041263. [PMID: 32079089 PMCID: PMC7068274 DOI: 10.3390/ijerph17041263] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 01/26/2020] [Accepted: 02/07/2020] [Indexed: 12/15/2022]
Abstract
The aim of this article is to promote the use of knowledge visualization frameworks in the creation and transfer of complex public health knowledge. The accessibility to healthy food items is an example of complex public health knowledge. The United States Department of Agriculture Food Access Research Atlas (FARA) dataset contains 147 variables for 72,864 census tracts and includes 16 food accessibility variables with binary values (0 or 1). Using four-digit and 16-digit binary patterns, we have developed data analytical procedures to group the 72,684 U.S. census tracts into eight and forty groups respectively. This value-added FARA dataset facilitated the design and production of interactive knowledge visualizations that have a collective purpose of knowledge transfer and specific functions including new insights on food accessibility and obesity rates in the United States. The knowledge visualizations of the binary patterns could serve as an integrated explanation and prediction system to help answer why and what-if questions on food accessibility, nutritional inequality and nutrition therapy for diabetic care at varying geographic units. In conclusion, the approach of knowledge visualizations could inform coordinated multi-level decision making for improving food accessibility and reducing chronic diseases in locations defined by patterns of food access measures.
Collapse
|
18
|
Yoon S, Nguyen HCT, Jo W, Kim J, Chi SM, Park J, Kim SY, Nam D. Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets. Nucleic Acids Res 2019; 47:e53. [PMID: 30820547 PMCID: PMC6511842 DOI: 10.1093/nar/gkz139] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2018] [Accepted: 02/19/2019] [Indexed: 12/26/2022] Open
Abstract
We present a novel approach to identify human microRNA (miRNA) regulatory modules (mRNA targets and relevant cell conditions) by biclustering a large collection of mRNA fold-change data for sequence-specific targets. Bicluster targets were assessed using validated messenger RNA (mRNA) targets and exhibited on an average 17.0% (median 19.4%) improved gain in certainty (sensitivity + specificity). The net gain was further increased up to 32.0% (median 33.4%) by incorporating functional networks of targets. We analyzed cancer-specific biclusters and found that the PI3K/Akt signaling pathway is strongly enriched with targets of a few miRNAs in breast cancer and diffuse large B-cell lymphoma. Indeed, five independent prognostic miRNAs were identified, and repression of bicluster targets and pathway activity by miR-29 was experimentally validated. In total, 29 898 biclusters for 459 human miRNAs were collected in the BiMIR database where biclusters are searchable for miRNAs, tissues, diseases, keywords and target genes.
Collapse
Affiliation(s)
- Sora Yoon
- School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Hai C T Nguyen
- School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Woobeen Jo
- School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Jinhwan Kim
- School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Sang-Mun Chi
- School of Computer Science and Engineering, Kyungsung University, Busan 48434, Republic of Korea
| | - Jiyoung Park
- School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Seon-Young Kim
- Department of Functional Genomics, University of Science and Technology (UST), Daejeon 34141, Republic of Korea.,Genome Editing Research Center, Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Republic of Korea
| | - Dougu Nam
- School of Life Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea.,Department of Mathematical Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| |
Collapse
|
19
|
Maâtouk O, Ayadi W, Bouziri H, Duval B. Evolutionary biclustering algorithms: an experimental study on microarray data. Soft comput 2019. [DOI: 10.1007/s00500-018-3394-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
20
|
|
21
|
González-Domínguez J, Expósito RR. ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems. PLoS One 2018; 13:e0194361. [PMID: 29608567 PMCID: PMC5880350 DOI: 10.1371/journal.pone.0194361] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Accepted: 03/01/2018] [Indexed: 11/18/2022] Open
Abstract
Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.
Collapse
Affiliation(s)
| | - Roberto R. Expósito
- Grupo de Arquitectura de Computadores, Universidade da Coruña, A Coruña, Spain
| |
Collapse
|
22
|
Kléma J, Malinka F, Železný F. Semantic biclustering for finding local, interpretable and predictive expression patterns. BMC Genomics 2017. [PMID: 29513193 PMCID: PMC5657082 DOI: 10.1186/s12864-017-4132-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background One of the major challenges in the analysis of gene expression data is to identify local patterns composed of genes showing coherent expression across subsets of experimental conditions. Such patterns may provide an understanding of underlying biological processes related to these conditions. This understanding can further be improved by providing concise characterizations of the genes and situations delimiting the pattern. Results We propose a method called semantic biclustering with the aim to detect interpretable rectangular patterns in binary data matrices. As usual in biclustering, we seek homogeneous submatrices, however, we also require that the included elements can be jointly described in terms of semantic annotations pertaining to both rows (genes) and columns (samples). To find such interpretable biclusters, we explore two strategies. The first endows an existing biclustering algorithm with the semantic ingredients. The other is based on rule and tree learning known from machine learning. Conclusions The two alternatives are tested in experiments with two Drosophila melanogaster gene expression datasets. Both strategies are shown to detect sets of compact biclusters with semantic descriptions that also remain largely valid for unseen (testing) data. This desirable generalization aspect is more emphasized in the strategy stemming from conventional biclustering although this is traded off by the complexity of the descriptions (number of ontology terms employed), which, on the other hand, is lower for the alternative strategy.
Collapse
Affiliation(s)
- Jiří Kléma
- Department of Computer Science, Czech Technical University in Prague, Karlovo náměstí 13, 121 35, Prague 2, Czech Republic.
| | - František Malinka
- Department of Computer Science, Czech Technical University in Prague, Karlovo náměstí 13, 121 35, Prague 2, Czech Republic
| | - Filip Železný
- Department of Computer Science, Czech Technical University in Prague, Karlovo náměstí 13, 121 35, Prague 2, Czech Republic
| |
Collapse
|
23
|
Martella F, Alfò M. A finite mixture approach to joint clustering of individuals and multivariate discrete outcomes. J STAT COMPUT SIM 2017. [DOI: 10.1080/00949655.2017.1322593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Francesca Martella
- Dipartimento di Scienze Statistiche, Sapienza Università di Roma, Rome, Italy
| | - Marco Alfò
- Dipartimento di Scienze Statistiche, Sapienza Università di Roma, Rome, Italy
| |
Collapse
|
24
|
Padilha VA, Campello RJGB. A systematic comparative evaluation of biclustering techniques. BMC Bioinformatics 2017; 18:55. [PMID: 28114903 PMCID: PMC5259837 DOI: 10.1186/s12859-017-1487-1] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Accepted: 01/14/2017] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND Biclustering techniques are capable of simultaneously clustering rows and columns of a data matrix. These techniques became very popular for the analysis of gene expression data, since a gene can take part of multiple biological pathways which in turn can be active only under specific experimental conditions. Several biclustering algorithms have been developed in the past recent years. In order to provide guidance regarding their choice, a few comparative studies were conducted and reported in the literature. In these studies, however, the performances of the methods were evaluated through external measures that have more recently been shown to have undesirable properties. Furthermore, they considered a limited number of algorithms and datasets. RESULTS We conducted a broader comparative study involving seventeen algorithms, which were run on three synthetic data collections and two real data collections with a more representative number of datasets. For the experiments with synthetic data, five different experimental scenarios were studied: different levels of noise, different numbers of implanted biclusters, different levels of symmetric bicluster overlap, different levels of asymmetric bicluster overlap and different bicluster sizes, for which the results were assessed with more suitable external measures. For the experiments with real datasets, the results were assessed by gene set enrichment and clustering accuracy. CONCLUSIONS We observed that each algorithm achieved satisfactory results in part of the biclustering tasks in which they were investigated. The choice of the best algorithm for some application thus depends on the task at hand and the types of patterns that one wants to detect.
Collapse
Affiliation(s)
- Victor A. Padilha
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, SP Brazil
| | - Ricardo J. G. B. Campello
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, SP Brazil
- College of Science and Engineering, James Cook University, Townsville, QLD Australia
| |
Collapse
|
25
|
Moore EJ, Bourlai T. Expectation Maximization of Frequent Patterns, a Specific, Local, Pattern-Based Biclustering Algorithm for Biological Datasets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:812-824. [PMID: 26701897 DOI: 10.1109/tcbb.2015.2510011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Currently, binary biclustering algorithms are too slow and non-specific to handle biological datasets that have a large number of attributes, which is essential for the computational biology problem of microarray analysis. Specialized computers may be needed to execute an algorithm, and may fail to produce a solution, due to its large resource needs. The biclusters also include too many false positives, the type I error, which hinders biological discovery. We propose an algorithm that can analyze datasets with a large attribute set at different densities, and can operate on a laptop, which makes it accessible to practitioners. EMFP produces biclusters that have a very low Root Mean Squared Error and false positive rate, with very few type II errors. Our binary biclustering algorithm is a hybrid, axis-parallel, pattern-based algorithm that finds multiple, non-overlapping, near-constant, deterministic, binary submatricies, with a variable confidence threshold, and the novel use of local density comparisons versus the standard global threshold. EMFP introduces a new, and intuitive way to calculate internal measures for binary biclustering methods. We also introduce a framework to ease comparison with other algorithms, and compare to both binary and general biclustering algorithms using two real, and 80 synthetic databases.
Collapse
|
26
|
López-Fernández H, Santos HM, Capelo JL, Fdez-Riverola F, Glez-Peña D, Reboiro-Jato M. Mass-Up: an all-in-one open software application for MALDI-TOF mass spectrometry knowledge discovery. BMC Bioinformatics 2015; 16:318. [PMID: 26437641 PMCID: PMC4595311 DOI: 10.1186/s12859-015-0752-4] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2015] [Accepted: 09/28/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mass spectrometry is one of the most important techniques in the field of proteomics. MALDI-TOF mass spectrometry has become popular during the last decade due to its high speed and sensitivity for detecting proteins and peptides. MALDI-TOF-MS can be also used in combination with Machine Learning techniques and statistical methods for knowledge discovery. Although there are many software libraries and tools that can be combined for these kind of analysis, there is still a need for all-in-one solutions with graphical user-friendly interfaces and avoiding the need of programming skills. RESULTS Mass-Up, an open software multiplatform application for MALDI-TOF-MS knowledge discovery is herein presented. Mass-Up software allows data preprocessing, as well as subsequent analysis including (i) biomarker discovery, (ii) clustering, (iii) biclustering, (iv) three-dimensional PCA visualization and (v) classification of large sets of spectra data. CONCLUSIONS Mass-Up brings knowledge discovery within reach of MALDI-TOF-MS researchers. Mass-Up is distributed under license GPLv3 and it is open and free to all users at http://sing.ei.uvigo.es/mass-up.
Collapse
Affiliation(s)
- H López-Fernández
- Informatics Department, Universidad de Vigo, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain. .,Instituto de Investigación Biomédica de Vigo (IBIV), Vigo, Pontevedra, Spain.
| | - H M Santos
- BIOSCOPE Research Group, UCIBIO-REQUIMTE, Department of Chemistry, Faculty of Science and Technology, Universidade NOVA de Lisboa, Caparica, Setubal, Portugal.
| | - J L Capelo
- BIOSCOPE Research Group, UCIBIO-REQUIMTE, Department of Chemistry, Faculty of Science and Technology, Universidade NOVA de Lisboa, Caparica, Setubal, Portugal.
| | - F Fdez-Riverola
- Informatics Department, Universidad de Vigo, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain. .,Instituto de Investigación Biomédica de Vigo (IBIV), Vigo, Pontevedra, Spain.
| | - D Glez-Peña
- Informatics Department, Universidad de Vigo, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain. .,Instituto de Investigación Biomédica de Vigo (IBIV), Vigo, Pontevedra, Spain.
| | - M Reboiro-Jato
- Informatics Department, Universidad de Vigo, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain. .,Instituto de Investigación Biomédica de Vigo (IBIV), Vigo, Pontevedra, Spain.
| |
Collapse
|
27
|
Nepomuceno JA, Troncoso A, Aguilar-Ruiz JS. Scatter search-based identification of local patterns with positive and negative correlations in gene expression data. Appl Soft Comput 2015. [DOI: 10.1016/j.asoc.2015.06.019] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
28
|
Horta D, Campello RJGB. Similarity Measures for Comparing Biclusterings. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:942-954. [PMID: 26356865 DOI: 10.1109/tcbb.2014.2325016] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The comparison of ordinary partitions of a set of objects is well established in the clustering literature, which comprehends several studies on the analysis of the properties of similarity measures for comparing partitions. However, similarity measures for clusterings are not readily applicable to biclusterings, since each bicluster is a tuple of two sets (of rows and columns), whereas a cluster is only a single set (of rows). Some biclustering similarity measures have been defined as minor contributions in papers which primarily report on proposals and evaluation of biclustering algorithms or comparative analyses of biclustering algorithms. The consequence is that some desirable properties of such measures have been overlooked in the literature. We review 14 biclustering similarity measures. We define eight desirable properties of a biclustering measure, discuss their importance, and prove which properties each of the reviewed measures has. We show examples drawn and inspired from important studies in which several biclustering measures convey misleading evaluations due to the absence of one or more of the discussed properties. We also advocate the use of a more general comparison approach that is based on the idea of transforming the original problem of comparing biclusterings into an equivalent problem of comparing clustering partitions with overlapping clusters.
Collapse
|
29
|
Muñoz-Mérida A, Viguera E, Claros MG, Trelles O, Pérez-Pulido AJ. Sma3s: a three-step modular annotator for large sequence datasets. DNA Res 2014; 21:341-53. [PMID: 24501397 PMCID: PMC4131829 DOI: 10.1093/dnares/dsu001] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes.
Collapse
Affiliation(s)
- Antonio Muñoz-Mérida
- Integrated Bioinformatics, National Institute for Bioinformatics, University of Málaga, Campus de Teatinos, Spain
| | - Enrique Viguera
- Cellular Biology, Genetics and Physiology Department, University of Málaga, Campus de Teatinos, Spain
| | - M Gonzalo Claros
- Molecular Biology and Biochemistry Department, University of Málaga, Campus de Teatinos, Spain
| | - Oswaldo Trelles
- Integrated Bioinformatics, National Institute for Bioinformatics, University of Málaga, Campus de Teatinos, Spain Computer Architecture Department, University of Málaga, Campus de Teatinos, Spain
| | - Antonio J Pérez-Pulido
- Centro Andaluz de Biología del Desarrollo (CABD, UPO-CSIC-JA), Facultad de Ciencias Experimentales (Área de Genética), Universidad Pablo de Olavide, Sevilla 41013, Spain
| |
Collapse
|
30
|
Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data. ScientificWorldJournal 2014; 2014:870406. [PMID: 24616651 PMCID: PMC3925583 DOI: 10.1155/2014/870406] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2013] [Accepted: 12/04/2013] [Indexed: 11/18/2022] Open
Abstract
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.
Collapse
|
31
|
Das C, Maji P. Possibilistic biclustering algorithm for discovering value-coherent overlapping δ-biclusters. INT J MACH LEARN CYB 2013. [DOI: 10.1007/s13042-013-0211-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
32
|
Chen HC, Zou W, Tien YJ, Chen JJ. Identification of bicluster regions in a binary matrix and its applications. PLoS One 2013; 8:e71680. [PMID: 23940779 PMCID: PMC3733970 DOI: 10.1371/journal.pone.0071680] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2012] [Accepted: 07/09/2013] [Indexed: 11/18/2022] Open
Abstract
Biclustering has emerged as an important approach to the analysis of large-scale datasets. A biclustering technique identifies a subset of rows that exhibit similar patterns on a subset of columns in a data matrix. Many biclustering methods have been proposed, and most, if not all, algorithms are developed to detect regions of "coherence" patterns. These methods perform unsatisfactorily if the purpose is to identify biclusters of a constant level. This paper presents a two-step biclustering method to identify constant level biclusters for binary or quantitative data. This algorithm identifies the maximal dimensional submatrix such that the proportion of non-signals is less than a pre-specified tolerance δ. The proposed method has much higher sensitivity and slightly lower specificity than several prominent biclustering methods from the analysis of two synthetic datasets. It was further compared with the Bimax method for two real datasets. The proposed method was shown to perform the most robust in terms of sensitivity, number of biclusters and number of serotype-specific biclusters identified. However, dichotomization using different signal level thresholds usually leads to different sets of biclusters; this also occurs in the present analysis.
Collapse
Affiliation(s)
- Hung-Chia Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
- Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan
| | - Wen Zou
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
| | - Yin-Jing Tien
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - James J. Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
- Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan
| |
Collapse
|
33
|
On measures of cohesiveness under dichotomous opinions: Some characterizations of approval consensus measures. Inf Sci (N Y) 2013. [DOI: 10.1016/j.ins.2013.03.061] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
34
|
|
35
|
Gusenleitner D, Howe EA, Bentink S, Quackenbush J, Culhane AC. iBBiG: iterative binary bi-clustering of gene sets. ACTA ACUST UNITED AC 2012; 28:2484-92. [PMID: 22789589 PMCID: PMC3463116 DOI: 10.1093/bioinformatics/bts438] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Motivation: Meta-analysis of genomics data seeks to identify genes associated with a biological phenotype across multiple datasets; however, merging data from different platforms by their features (genes) is challenging. Meta-analysis using functionally or biologically characterized gene sets simplifies data integration is biologically intuitive and is seen as having great potential, but is an emerging field with few established statistical methods. Results: We transform gene expression profiles into binary gene set profiles by discretizing results of gene set enrichment analyses and apply a new iterative bi-clustering algorithm (iBBiG) to identify groups of gene sets that are coordinately associated with groups of phenotypes across multiple studies. iBBiG is optimized for meta-analysis of large numbers of diverse genomics data that may have unmatched samples. It does not require prior knowledge of the number or size of clusters. When applied to simulated data, it outperforms commonly used clustering methods, discovers overlapping clusters of diverse sizes and is robust in the presence of noise. We apply it to meta-analysis of breast cancer studies, where iBBiG extracted novel gene set—phenotype association that predicted tumor metastases within tumor subtypes. Availability: Implemented in the Bioconductor package iBBiG Contact:aedin@jimmy.harvard.edu
Collapse
Affiliation(s)
- Daniel Gusenleitner
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA
| | | | | | | | | |
Collapse
|