1
|
Yuan W, Li Y, Han Z, Chen Y, Xie J, Chen J, Bi Z, Xi J. Evolutionary Mechanism Based Conserved Gene Expression Biclustering Module Analysis for Breast Cancer Genomics. Biomedicines 2024; 12:2086. [PMID: 39335599 PMCID: PMC11428256 DOI: 10.3390/biomedicines12092086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 08/23/2024] [Accepted: 09/02/2024] [Indexed: 09/30/2024] Open
Abstract
The identification of significant gene biclusters with particular expression patterns and the elucidation of functionally related genes within gene expression data has become a critical concern due to the vast amount of gene expression data generated by RNA sequencing technology. In this paper, a Conserved Gene Expression Module based on Genetic Algorithm (CGEMGA) is proposed. Breast cancer data from the TCGA database is used as the subject of this study. The p-values from Fisher's exact test are used as evaluation metrics to demonstrate the significance of different algorithms, including the Cheng and Church algorithm, CGEM algorithm, etc. In addition, the F-test is used to investigate the difference between our method and the CGEM algorithm. The computational cost of the different algorithms is further investigated by calculating the running time of each algorithm. Finally, the established driver genes and cancer-related pathways are used to validate the process. The results of 10 independent runs demonstrate that CGEMGA has a superior average p-value of 1.54 × 10-4 ± 3.06 × 10-5 compared to all other algorithms. Furthermore, our approach exhibits consistent performance across all methods. The F-test yields a p-value of 0.039, indicating a significant difference between our approach and the CGEM. Computational cost statistics also demonstrate that our approach has a significantly shorter average runtime of 5.22 × 100 ± 1.65 × 10-1 s compared to the other algorithms. Enrichment analysis indicates that the genes in our approach are significantly enriched for driver genes. Our algorithm is fast and robust, efficiently extracting co-expressed genes and associated co-expression condition biclusters from RNA-seq data.
Collapse
Affiliation(s)
- Wei Yuan
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou 511436, China
| | - Yaming Li
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou 511436, China
| | - Zhengpan Han
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou 511436, China
| | - Yu Chen
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou 511436, China
| | - Jinnan Xie
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou 511436, China
| | - Jianguo Chen
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou 511436, China
| | - Zhisheng Bi
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou 511436, China
| | - Jianing Xi
- School of Biomedical Engineering, Guangzhou Medical University, Guangzhou 511436, China
| |
Collapse
|
2
|
Rahaman MA, Fu Z, Iraji A, Calhoun V. SpaDE: Semantic Locality Preserving Biclustering for Neuroimaging Data. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-5. [PMID: 40039923 DOI: 10.1109/embc53108.2024.10782417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
The most discriminative and revealing patterns in the neuroimaging population are often confined to smaller subdivisions of the samples and features. Especially in neuropsychiatric conditions, symptoms are expressed within micro subgroups of individuals and may only underly a subset of neurological mechanisms. As such, running a whole-population analysis yields suboptimal outcomes leading to reduced specificity and interpretability. Biclustering is a potential solution since subject heterogeneity makes one-dimensional clustering less effective in this realm. Yet, high dimensional sparse input space and semantically incoherent grouping of attributes make post hoc analysis challenging. Therefore, we propose a deep neural network called semantic locality preserving auto decoder (SpaDE), for unsupervised feature learning and biclustering. SpaDE produces coherent subgroups of subjects and neural features preserving semantic locality and enhancing neurobiological interpretability. Also, it regularizes for sparsity to improve representation learning. We employ SpaDE on human brain connectome collected from schizophrenia (SZ) and healthy control (HC) subjects. The model outperforms several state-of-the-art biclustering methods. Our method extracts modular neural communities showing significant (HC/SZ) group differences in distinct brain networks including visual, sensorimotor, and subcortical. Moreover, these bi-clustered connectivity substructures exhibit substantial relations with various cognitive measures such as attention, working memory, and visual learning.
Collapse
|
3
|
Castanho EN, Aidos H, Madeira SC. Biclustering data analysis: a comprehensive survey. Brief Bioinform 2024; 25:bbae342. [PMID: 39007596 PMCID: PMC11247412 DOI: 10.1093/bib/bbae342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 05/16/2024] [Accepted: 07/01/2024] [Indexed: 07/16/2024] Open
Abstract
Biclustering, the simultaneous clustering of rows and columns of a data matrix, has proved its effectiveness in bioinformatics due to its capacity to produce local instead of global models, evolving from a key technique used in gene expression data analysis into one of the most used approaches for pattern discovery and identification of biological modules, used in both descriptive and predictive learning tasks. This survey presents a comprehensive overview of biclustering. It proposes an updated taxonomy for its fundamental components (bicluster, biclustering solution, biclustering algorithms, and evaluation measures) and applications. We unify scattered concepts in the literature with new definitions to accommodate the diversity of data types (such as tabular, network, and time series data) and the specificities of biological and biomedical data domains. We further propose a pipeline for biclustering data analysis and discuss practical aspects of incorporating biclustering in real-world applications. We highlight prominent application domains, particularly in bioinformatics, and identify typical biclusters to illustrate the analysis output. Moreover, we discuss important aspects to consider when choosing, applying, and evaluating a biclustering algorithm. We also relate biclustering with other data mining tasks (clustering, pattern mining, classification, triclustering, N-way clustering, and graph mining). Thus, it provides theoretical and practical guidance on biclustering data analysis, demonstrating its potential to uncover actionable insights from complex datasets.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 16, P-1749-016 Lisbon, Portugal
| |
Collapse
|
4
|
Xu X, Zhang S, Guo J, Xin T. Biclustering of Log Data: Insights from a Computer-Based Complex Problem Solving Assessment. J Intell 2024; 12:10. [PMID: 38248908 PMCID: PMC10817361 DOI: 10.3390/jintelligence12010010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Revised: 12/17/2023] [Accepted: 01/12/2024] [Indexed: 01/23/2024] Open
Abstract
Computer-based assessments provide the opportunity to collect a new source of behavioral data related to the problem-solving process, known as log file data. To understand the behavioral patterns that can be uncovered from these process data, many studies have employed clustering methods. In contrast to one-mode clustering algorithms, this study utilized biclustering methods, enabling simultaneous classification of test takers and features extracted from log files. By applying the biclustering algorithms to the "Ticket" task in the PISA 2012 CPS assessment, we evaluated the potential of biclustering algorithms in identifying and interpreting homogeneous biclusters from the process data. Compared with one-mode clustering algorithms, the biclustering methods could uncover clusters of individuals who are homogeneous on a subset of feature variables, holding promise for gaining fine-grained insights into students' problem-solving behavior patterns. Empirical results revealed that specific subsets of features played a crucial role in identifying biclusters. Additionally, the study explored the utilization of biclustering on both the action sequence data and timing data, and the inclusion of time-based features enhanced the understanding of students' action sequences and scores in the context of the analysis.
Collapse
Affiliation(s)
- Xin Xu
- Collaborative Innovation Center of Assessment for Basic Education Quality, Beijing Normal University, Beijing 100875, China;
| | - Susu Zhang
- Departments of Psychology and Statistics, University of Illinois Urbana-Champaign, Champaign, IL 61820, USA;
| | - Jinxin Guo
- College of Science, Minzu University of China, Beijing 100081, China;
| | - Tao Xin
- Collaborative Innovation Center of Assessment for Basic Education Quality, Beijing Normal University, Beijing 100875, China;
- School of Educational Science, Anhui Normal University, Wuhu 241000, China
| |
Collapse
|
5
|
Castanho EN, Lobo JP, Henriques R, Madeira SC. G-bic: generating synthetic benchmarks for biclustering. BMC Bioinformatics 2023; 24:457. [PMID: 38053078 PMCID: PMC10698934 DOI: 10.1186/s12859-023-05587-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 11/28/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Biclustering is increasingly used in biomedical data analysis, recommendation tasks, and text mining domains, with hundreds of biclustering algorithms proposed. When assessing the performance of these algorithms, more than real datasets are required as they do not offer a solid ground truth. Synthetic data surpass this limitation by producing reference solutions to be compared with the found patterns. However, generating synthetic datasets is challenging since the generated data must ensure reproducibility, pattern representativity, and real data resemblance. RESULTS We propose G-Bic, a dataset generator conceived to produce synthetic benchmarks for the normative assessment of biclustering algorithms. Beyond expanding on aspects of pattern coherence, data quality, and positioning properties, it further handles specificities related to mixed-type datasets and time-series data.G-Bic has the flexibility to replicate real data regularities from diverse domains. We provide the default configurations to generate reproducible benchmarks to evaluate and compare diverse aspects of biclustering algorithms. Additionally, we discuss empirical strategies to simulate the properties of real data. CONCLUSION G-Bic is a parametrizable generator for biclustering analysis, offering a solid means to assess biclustering solutions according to internal and external metrics robustly.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal.
| | - João P Lobo
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| | - Rui Henriques
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1900-001, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| |
Collapse
|
6
|
Xiong J, Zhu H, Li X, Hao S, Zhang Y, Wang Z, Xi Q. Auto-Classification of Parkinson's Disease with Different Motor Subtypes Using Arterial Spin Labelling MRI Based on Machine Learning. Brain Sci 2023; 13:1524. [PMID: 38002484 PMCID: PMC10670033 DOI: 10.3390/brainsci13111524] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 10/26/2023] [Accepted: 10/28/2023] [Indexed: 11/26/2023] Open
Abstract
The purpose of this study was to automatically classify different motor subtypes of Parkinson's disease (PD) on arterial spin labelling magnetic resonance imaging (ASL-MRI) data using support vector machine (SVM). This study included 38 subjects: 21 PD patients and 17 normal controls (NCs). Based on the Unified Parkinson's Disease Rating Scale (UPDRS) subscores, patients were divided into the tremor-dominant (TD) subtype and the postural instability gait difficulty (PIGD) subtype. The subjects were in a resting state during the acquisition of ASL-MRI data. The automated anatomical atlas 3 (AAL3) template was registered to obtain an ASL image of the same size and shape. We obtained the voxel values of 170 brain regions by considering the location coordinates of these regions and then normalized the data. The length of the feature vector depended on the number of voxel values in each brain region. Three binary classification models were utilized for classifying subjects' data, and we applied SVM to classify voxels in the brain regions. The left subgenual anterior cingulate cortex (ACC_sub_L) was clearly distinguished in both NCs and PD patients using SVM, and we obtained satisfactory diagnostic rates (accuracy = 92.31%, specificity = 96.97%, sensitivity = 84.21%, and AUCmax = 0.9585). For the right supramarginal gyrus (SupraMarginal_R), SVM distinguished the TD group from the other groups with satisfactory diagnostic rates (accuracy = 84.21%, sensitivity = 63.64%, specificity = 92.59%, and AUCmax = 0.9192). For the right intralaminar of thalamus (Thal_IL_R), SVM distinguished the PIGD group from the other groups with satisfactory diagnostic rates (accuracy = 89.47%, sensitivity = 70.00%, specificity = 6.43%, and AUCmax = 0.9464). These results are consistent with the changes in blood perfusion related to PD subtypes. In addition, the sensitive brain regions of the TD group and PIGD group involve the brain regions where the cerebellothalamocortical (CTC) and the striatal thalamocortical (STC) loops are located. Therefore, it is suggested that the blood perfusion patterns of the two loops may be different. These characteristic brain regions could become potential imaging markers of cerebral blood flow to distinguish TD from PIGD. Meanwhile, our findings provide an imaging basis for personalised treatment, thereby optimising clinical diagnostic and treatment approaches.
Collapse
Affiliation(s)
- Jinhua Xiong
- Department of Radiology, Shanghai East Hospital, Tongji University School of Medicine, No. 150 Jimo Road, Pudong New Area, Shanghai 200120, China; (J.X.)
| | - Haiyan Zhu
- Department of Radiology, Shanghai Tongji Hospital, Tongji University School of Medicine, No. 389 Xincun Road, Putuo District, Shanghai 200065, China
| | - Xuhang Li
- School of Computer Science and Technology, Donghua University, No. 2999 North Renmin Road, Songjiang Area, Shanghai 200000, China
| | - Shangci Hao
- Department of Radiology, Shanghai East Hospital, Tongji University School of Medicine, No. 150 Jimo Road, Pudong New Area, Shanghai 200120, China; (J.X.)
| | - Yueyi Zhang
- Department of Radiology, Shanghai East Hospital, Tongji University School of Medicine, No. 150 Jimo Road, Pudong New Area, Shanghai 200120, China; (J.X.)
| | - Zijian Wang
- School of Computer Science and Technology, Donghua University, No. 2999 North Renmin Road, Songjiang Area, Shanghai 200000, China
| | - Qian Xi
- Department of Radiology, Shanghai East Hospital, Tongji University School of Medicine, No. 150 Jimo Road, Pudong New Area, Shanghai 200120, China; (J.X.)
| |
Collapse
|
7
|
Aidi MN, Wulandari C, Oktarina SD, Aditra TR, Ernawati F, Efriwati E, Nurjanah N, Rachmawati R, Julianti ED, Sundari D, Retiaty F, Arifin AY, Dewi RM, Nazaruddin N, Salimar S, Fuada N, Widodo Y, Setyawati B, Nurhidayati N, Sudikno S, Irawan IR, Widoretno W. Province clustering based on the percentage of communicable disease using the BCBimax biclustering algorithm. GEOSPATIAL HEALTH 2023; 18. [PMID: 37698368 DOI: 10.4081/gh.2023.1202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Accepted: 08/09/2023] [Indexed: 09/13/2023]
Abstract
Indonesia needs to lower its high infectious disease rate. This requires reliable data and following their temporal changes across provinces. We investigated the benefits of surveying the epidemiological situation with the imax biclustering algorithm using secondary data from a recent national scale survey of main infectious diseases from the National Basic Health Research (Riskesdas) covering 34 provinces in Indonesia. Hierarchical and k-means clustering can only handle one data source, but BCBimax biclustering can cluster rows and columns in a data matrix. Several experiments determined the best row and column threshold values, which is crucial for a useful result. The percentages of Indonesia's seven most common infectious diseases (ARI, pneumonia, diarrhoea, tuberculosis (TB), hepatitis, malaria, and filariasis) were ordered by province to form groups without considering proximity because clusters are usually far apart. ARI, pneumonia, and diarrhoea were divided into toddler and adult infections, making 10 target diseases instead of seven. The set of biclusters formed based on the presence and level of these diseases included 7 diseases with moderate to high disease levels, 5 diseases (formed by 2 clusters), 3 diseases, 2 diseases, and a final order that only included adult diarrhoea. In 6 of 8 clusters, diarrhea was the most prevalent infectious disease in Indonesia, making its eradication a priority. Direct person-to-person infections like ARI, pneumonia, TB, and diarrhoea were found in 4-6 of 8 clusters. These diseases are more common and spread faster than vector-borne diseases like malaria and filariasis, making them more important.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | - Dian Sundari
- National Research and Innovation Agency, Jakarta.
| | - Fifi Retiaty
- National Research and Innovation Agency, Jakarta.
| | | | | | | | | | | | - Yekti Widodo
- National Research and Innovation Agency, Jakarta.
| | | | | | | | | | | |
Collapse
|
8
|
Abstract
Sensors deployed within water distribution systems collect consumption data that enable the application of data analysis techniques to extract essential information. Time series clustering has been traditionally applied for modeling end-user water consumption profiles to aid water management. However, its effectiveness is limited by the diversity and local nature of consumption patterns. In addition, existing techniques cannot adequately handle changes in household composition, disruptive events (e.g., vacations), and consumption dynamics at different time scales. In this context, biclustering approaches provide a natural alternative to detect groups of end-users with coherent consumption profiles during local time periods while addressing the aforementioned limitations. This work discusses when, why and how to apply biclustering techniques for water consumption data analysis, and further proposes a methodology to this end. To the best of our knowledge, this is the first work introducing biclustering to water consumption data analysis. Results on data from a real-world water distribution system—Quinta do Lago, Portugal—confirm the potentialities of the proposed approach for pattern discovery with guarantees of statistical significance and robustness that entities can rely on for strategic planning.
Collapse
|