1
|
Ren CX, Xu GX, Dai DQ, Lin L, Sun Y, Liu QS. Cross-site prognosis prediction for nasopharyngeal carcinoma from incomplete multi-modal data. Med Image Anal 2024; 93:103103. [PMID: 38368752 DOI: 10.1016/j.media.2024.103103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 12/05/2023] [Accepted: 02/05/2024] [Indexed: 02/20/2024]
Abstract
Accurate prognosis prediction for nasopharyngeal carcinoma based on magnetic resonance (MR) images assists in the guidance of treatment intensity, thus reducing the risk of recurrence and death. To reduce repeated labor and sufficiently explore domain knowledge, aggregating labeled/annotated data from external sites enables us to train an intelligent model for a clinical site with unlabeled data. However, this task suffers from the challenges of incomplete multi-modal examination data fusion and image data heterogeneity among sites. This paper proposes a cross-site survival analysis method for prognosis prediction of nasopharyngeal carcinoma from domain adaptation viewpoint. Utilizing a Cox model as the basic framework, our method equips it with a cross-attention based multi-modal fusion regularization. This regularization model effectively fuses the multi-modal information from multi-parametric MR images and clinical features onto a domain-adaptive space, despite the absence of some modalities. To enhance the feature discrimination, we also extend the contrastive learning technique to censored data cases. Compared with the conventional approaches which directly deploy a trained survival model in a new site, our method achieves superior prognosis prediction performance in cross-site validation experiments. These results highlight the key role of cross-site adaptability of our method and support its value in clinical practice.
Collapse
Affiliation(s)
- Chuan-Xian Ren
- School of Mathematics, Sun Yat-sen University, Guangzhou 510275, China.
| | - Geng-Xin Xu
- School of Mathematics, Sun Yat-sen University, Guangzhou 510275, China
| | - Dao-Qing Dai
- School of Mathematics, Sun Yat-sen University, Guangzhou 510275, China
| | - Li Lin
- Department of Radiation Oncology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine; Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangzhou 510060, China
| | - Ying Sun
- Department of Radiation Oncology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine; Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangzhou 510060, China
| | - Qing-Shan Liu
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| |
Collapse
|
2
|
Ren CX, Luo YW, Dai DQ. BuresNet: Conditional Bures Metric for Transferable Representation Learning. IEEE Trans Pattern Anal Mach Intell 2023; 45:4198-4213. [PMID: 35830411 DOI: 10.1109/tpami.2022.3190645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
As a fundamental manner for learning and cognition, transfer learning has attracted widespread attention in recent years. Typical transfer learning tasks include unsupervised domain adaptation (UDA) and few-shot learning (FSL), which both attempt to sufficiently transfer discriminative knowledge from the training environment to the test environment to improve the model's generalization performance. Previous transfer learning methods usually ignore the potential conditional distribution shift between environments. This leads to the discriminability degradation in the test environments. Therefore, how to construct a learnable and interpretable metric to measure and then reduce the gap between conditional distributions is very important in the literature. In this article, we design the Conditional Kernel Bures (CKB) metric for characterizing conditional distribution discrepancy, and derive an empirical estimation with convergence guarantee. CKB provides a statistical and interpretable approach, under the optimal transportation framework, to understand the knowledge transfer mechanism. It is essentially an extension of optimal transportation from the marginal distributions to the conditional distributions. CKB can be used as a plug-and-play module and placed onto the loss layer in deep networks, thus, it plays the bottleneck role in representation learning. From this perspective, the new method with network architecture is abbreviated as BuresNet, and it can be used extract conditional invariant features for both UDA and FSL tasks. BuresNet can be trained in an end-to-end manner. Extensive experiment results on several benchmark datasets validate the effectiveness of BuresNet.
Collapse
|
3
|
Huang KK, Ren CX, Liu H, Lai ZR, Yu YF, Dai DQ. Hyperspectral Image Classification via Discriminant Gabor Ensemble Filter. IEEE Trans Cybern 2022; 52:8352-8365. [PMID: 33544687 DOI: 10.1109/tcyb.2021.3051141] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
For a broad range of applications, hyperspectral image (HSI) classification is a hot topic in remote sensing, and convolutional neural network (CNN)-based methods are drawing increasing attention. However, to train millions of parameters in CNN requires a large number of labeled training samples, which are difficult to collect. A conventional Gabor filter can effectively extract spatial information with different scales and orientations without training, but it may be missing some important discriminative information. In this article, we propose the Gabor ensemble filter (GEF), a new convolutional filter to extract deep features for HSI with fewer trainable parameters. GEF filters each input channel by some fixed Gabor filters and learnable filters simultaneously, then reduces the dimensions by some learnable 1×1 filters to generate the output channels. The fixed Gabor filters can extract common features with different scales and orientations, while the learnable filters can learn some complementary features that Gabor filters cannot extract. Based on GEF, we design a network architecture for HSI classification, which extracts deep features and can learn from limited training samples. In order to simultaneously learn more discriminative features and an end-to-end system, we propose to introduce the local discriminant structure for cross-entropy loss by combining the triplet hard loss. Results of experiments on three HSI datasets show that the proposed method has significantly higher classification accuracy than other state-of-the-art methods. Moreover, the proposed method is speedy for both training and testing.
Collapse
|
4
|
Luo YW, Ren CX, Dai DQ, Yan H. Unsupervised Domain Adaptation via Discriminative Manifold Propagation. IEEE Trans Pattern Anal Mach Intell 2022; 44:1653-1669. [PMID: 32749963 DOI: 10.1109/tpami.2020.3014218] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Unsupervised domain adaptation is effective in leveraging rich information from a labeled source domain to an unlabeled target domain. Though deep learning and adversarial strategy made a significant breakthrough in the adaptability of features, there are two issues to be further studied. First, hard-assigned pseudo labels on the target domain are arbitrary and error-prone, and direct application of them may destroy the intrinsic data structure. Second, batch-wise training of deep learning limits the characterization of the global structure. In this paper, a Riemannian manifold learning framework is proposed to achieve transferability and discriminability simultaneously. For the first issue, this framework establishes a probabilistic discriminant criterion on the target domain via soft labels. Based on pre-built prototypes, this criterion is extended to a global approximation scheme for the second issue. Manifold metric alignment is adopted to be compatible with the embedding space. The theoretical error bounds of different alignment metrics are derived for constructive guidance. The proposed method can be used to tackle a series of variants of domain adaptation problems, including both vanilla and partial settings. Extensive experiments have been conducted to investigate the method and a comparative study shows the superiority of the discriminative manifold learning framework.
Collapse
|
5
|
Wang W, Zhang X, Dai DQ. springD2A: capturing uncertainty in disease-drug association prediction with model integration. Bioinformatics 2022; 38:1353-1360. [PMID: 34864881 DOI: 10.1093/bioinformatics/btab820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 11/23/2021] [Accepted: 11/30/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Drug repositioning that aims to find new indications for existing drugs has been an efficient strategy for drug discovery. In the scenario where we only have confirmed disease-drug associations as positive pairs, a negative set of disease-drug pairs is usually constructed from the unknown disease-drug pairs in previous studies, where we do not know whether drugs and diseases can be associated, to train a model for disease-drug association prediction (drug repositioning). Drugs and diseases in these negative pairs can potentially be associated, but most studies have ignored them. RESULTS We present a method, springD2A, to capture the uncertainty in the negative pairs, and to discriminate between positive and unknown pairs because the former are more reliable. In springD2A, we introduce a spring-like penalty for the loss of negative pairs, which is strong if they are too close in a unit sphere, but mild if they are at a moderate distance. We also design a sequential sampling in which the probability of an unknown disease-drug pair sampled as negative is proportional to its score predicted as positive. Multiple models are learned during sequential sampling, and we adopt parameter- and feature-based ensemble schemes to boost performance. Experiments show springD2A is an effective tool for drug-repositioning. AVAILABILITY AND IMPLEMENTATION A python implementation of springD2A and datasets used in this study are available at https://github.com/wangyuanhao/springD2A. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou 510000, China
| | - Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou 510000, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou 510000, China
| |
Collapse
|
6
|
Wijayawardene NN, Hyde KD, Dai DQ, Sánchez-García M, Goto BT, Saxena RK, Erdoğdu M, Selçuk F, Rajeshkumar KC, Aptroot A, Błaszkowski J, Boonyuen N, da Silva GA, de Souza FA, Dong W, Ertz D, Haelewaters D, Jones EBG, Karunarathna SC, Kirk PM, Kukwa M, Kumla J, Leontyev DV, Lumbsch HT, Maharachchikumbura SSN, Marguno F, Martínez-Rodríguez P, Mešić A, Monteiro JS, Oehl F, Pawłowska J, Pem D, Pfliegler WP, Phillips AJL, Pošta A, He MQ, Li JX, Raza M, Sruthi OP, Suetrong S, Suwannarach N, Tedersoo L, Thiyagaraja V, Tibpromma S, Tkalčec Z, Tokarev YS, Wanasinghe DN, Wijesundara DSA, Wimalaseana SDMK, Madrid H, Zhang GQ, Gao Y, Sánchez-Castro I, Tang LZ, Stadler M, Yurkov A, Thines M. Outline of Fungi and fungus-like taxa – 2021. MYCOSPHERE 2022. [DOI: 10.5943/mycosphere/13/1/2] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|
7
|
Sun Y, Ou-Yang L, Dai DQ. WMLRR: A Weighted Multi-View Low Rank Representation to Identify Cancer Subtypes From Multiple Types of Omics Data. IEEE/ACM Trans Comput Biol Bioinform 2021; 18:2891-2897. [PMID: 33656995 DOI: 10.1109/tcbb.2021.3063284] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The identification of cancer subtypes is of great importance for understanding the heterogeneity of tumors and providing patients with more accurate diagnoses and treatments. However, it is still a challenge to effectively integrate multiple omics data to establish cancer subtypes. In this paper, we propose an unsupervised integration method, named weighted multi-view low rank representation (WMLRR), to identify cancer subtypes from multiple types of omics data. Given a group of patients described by multiple omics data matrices, we first learn a unified affinity matrix which encodes the similarities among patients by exploring the sparsity-consistent low-rank representations from the joint decompositions of multiple omics data matrices. Unlike existing subtype identification methods that treat each omics data matrix equally, we assign a weight to each omics data matrix and learn these weights automatically through the optimization process. Finally, we apply spectral clustering on the learned affinity matrix to identify cancer subtypes. Experiment results show that the survival times between our identified cancer subtypes are significantly different, and our predicted survivals are more accurate than other state-of-the-art methods. In addition, some clinical analyses of the diseases also demonstrate the effectiveness of our method in identifying molecular subtypes with biological significance and clinical relevance.
Collapse
|
8
|
Zhang X, Wang W, Ren CX, Dai DQ. Learning representation for multiple biological networks via a robust graph regularized integration approach. Brief Bioinform 2021; 23:6381251. [PMID: 34607360 DOI: 10.1093/bib/bbab409] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Revised: 08/23/2021] [Accepted: 09/06/2021] [Indexed: 01/18/2023] Open
Abstract
Learning node representation is a fundamental problem in biological network analysis, as compact representation features reveal complicated network structures and carry useful information for downstream tasks such as link prediction and node classification. Recently, multiple networks that profile objects from different aspects are increasingly accumulated, providing the opportunity to learn objects from multiple perspectives. However, the complex common and specific information across different networks pose challenges to node representation methods. Moreover, ubiquitous noise in networks calls for more robust representation. To deal with these problems, we present a representation learning method for multiple biological networks. First, we accommodate the noise and spurious edges in networks using denoised diffusion, providing robust connectivity structures for the subsequent representation learning. Then, we introduce a graph regularized integration model to combine refined networks and compute common representation features. By using the regularized decomposition technique, the proposed model can effectively preserve the common structural property of different networks and simultaneously accommodate their specific information, leading to a consistent representation. A simulation study shows the superiority of the proposed method on different levels of noisy networks. Three network-based inference tasks, including drug-target interaction prediction, gene function identification and fine-grained species categorization, are conducted using representation features learned from our method. Biological networks at different scales and levels of sparsity are involved. Experimental results on real-world data show that the proposed method has robust performance compared with alternatives. Overall, by eliminating noise and integrating effectively, the proposed method is able to learn useful representations from multiple biological networks.
Collapse
Affiliation(s)
- Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Chuan-Xian Ren
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, 510275, Guangzhou, China
| |
Collapse
|
9
|
Song W, Wang W, Dai DQ. Subtype-WESLR: identifying cancer subtype with weighted ensemble sparse latent representation of multi-view data. Brief Bioinform 2021; 23:6381248. [PMID: 34607358 DOI: 10.1093/bib/bbab398] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 08/30/2021] [Accepted: 09/01/2021] [Indexed: 12/13/2022] Open
Abstract
The discovery of cancer subtypes has become much-researched topic in oncology. Dividing cancer patients into subtypes can provide personalized treatments for heterogeneous patients. High-throughput technologies provide multiple omics data for cancer subtyping. Integration of multi-view data is used to identify cancer subtypes in many computational methods, which obtain different subtypes for the same cancer, even using the same multi-omics data. To a certain extent, these subtypes from distinct methods are related, which may have certain guiding significance for cancer subtyping. It is a challenge to effectively utilize the valuable information of distinct subtypes to produce more accurate and reliable subtypes. A weighted ensemble sparse latent representation (subtype-WESLR) is proposed to detect cancer subtypes on heterogeneous omics data. Using a weighted ensemble strategy to fuse base clustering obtained by distinct methods as prior knowledge, subtype-WESLR projects each sample feature profile from each data type to a common latent subspace while maintaining the local structure of the original sample feature space and consistency with the weighted ensemble and optimizes the common subspace by an iterative method to identify cancer subtypes. We conduct experiments on various synthetic datasets and eight public multi-view datasets from The Cancer Genome Atlas. The results demonstrate that subtype-WESLR is better than competing methods by utilizing the integration of base clustering of exist methods for more precise subtypes.
Collapse
Affiliation(s)
- Wenjing Song
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| |
Collapse
|
10
|
Wang W, Zhang X, Dai DQ. DeFusion: a denoised network regularization framework for multi-omics integration. Brief Bioinform 2021; 22:6210063. [PMID: 33822879 DOI: 10.1093/bib/bbab057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 02/03/2021] [Accepted: 01/14/2020] [Indexed: 11/13/2022] Open
Abstract
With diverse types of omics data widely available, many computational methods have been recently developed to integrate these heterogeneous data, providing a comprehensive understanding of diseases and biological mechanisms. But most of them hardly take noise effects into account. Data-specific patterns unique to data types also make it challenging to uncover the consistent patterns and learn a compact representation of multi-omics data. Here we present a multi-omics integration method considering these issues. We explicitly model the error term in data reconstruction and simultaneously consider noise effects and data-specific patterns. We utilize a denoised network regularization in which we build a fused network using a denoising procedure to suppress noise effects and data-specific patterns. The error term collaborates with the denoised network regularization to capture data-specific patterns. We solve the optimization problem via an inexact alternating minimization algorithm. A comparative simulation study shows the method's superiority at discovering common patterns among data types at three noise levels. Transcriptomics-and-epigenomics integration, in seven cancer cohorts from The Cancer Genome Atlas, demonstrates that the learned integrative representation extracted in an unsupervised manner can depict survival information. Specially in liver hepatocellular carcinoma, the learned integrative representation attains average Harrell's C-index of 0.78 in 10 times 3-fold cross-validation for survival prediction, which far exceeds competing methods, and we discover an aggressive subtype in liver hepatocellular carcinoma with this latent representation, which is validated by an external dataset GSE14520. We also show that DeFusion is applicable to the integration of other omics types.
Collapse
Affiliation(s)
- Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| |
Collapse
|
11
|
Ren CX, Ge P, Dai DQ, Yan H. Learning Kernel for Conditional Moment-Matching Discrepancy-Based Image Classification. IEEE Trans Cybern 2021; 51:2006-2018. [PMID: 31150354 DOI: 10.1109/tcyb.2019.2916198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Conditional maximum mean discrepancy (CMMD) can capture the discrepancy between conditional distributions by drawing support from nonlinear kernel functions; thus, it has been successfully used for pattern classification. However, CMMD does not work well on complex distributions, especially when the kernel function fails to correctly characterize the difference between intraclass similarity and interclass similarity. In this paper, a new kernel learning method is proposed to improve the discrimination performance of CMMD. It can be operated with deep network features iteratively and thus denoted as KLN for abbreviation. The CMMD loss and an autoencoder (AE) are used to learn an injective function. By considering the compound kernel, that is, the injective function with a characteristic kernel, the effectiveness of CMMD for data category description is enhanced. KLN can simultaneously learn a more expressive kernel and label prediction distribution; thus, it can be used to improve the classification performance in both supervised and semisupervised learning scenarios. In particular, the kernel-based similarities are iteratively learned on the deep network features, and the algorithm can be implemented in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets, including MNIST, SVHN, CIFAR-10, and CIFAR-100. The results indicate that KLN achieves the state-of-the-art classification performance.
Collapse
|
12
|
Ren CX, Feng J, Dai DQ, Yan S. Heterogeneous Domain Adaptation via Covariance Structured Feature Translators. IEEE Trans Cybern 2021; 51:2166-2177. [PMID: 31880576 DOI: 10.1109/tcyb.2019.2957033] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Domain adaptation (DA) and transfer learning with statistical property description is very important in image analysis and data classification. This article studies the domain adaptive feature representation problem for the heterogeneous data, of which both the feature dimensions and the sample distributions across domains are so different that their features cannot be matched directly. To transfer the discriminant information efficiently from the source domain to the target domain, and then enhance the classification performance for the target data, we first introduce two projection matrices specified for different domains to transform the heterogeneous features into a shared space. We then propose a joint kernel regression model to learn the regression variable, which is called feature translator in this article. The novelty focuses on the exploration of optimal experimental design (OED) to deal with the heterogeneous and nonlinear DA by seeking the covariance structured feature translators (CSFTs). An approximate and efficient method is proposed to compute the optimal data projections. Comprehensive experiments are conducted to validate the effectiveness and efficacy of the proposed model. The results show the state-of-the-art performance of our method in heterogeneous DA.
Collapse
|
13
|
Yu YF, Xu G, Jiang M, Zhu H, Dai DQ, Yan H. Joint Transformation Learning via the L 2,1-Norm Metric for Robust Graph Matching. IEEE Trans Cybern 2021; 51:521-533. [PMID: 31059466 DOI: 10.1109/tcyb.2019.2912718] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Establishing correspondence between two given geometrical graph structures is an important problem in computer vision and pattern recognition. In this paper, we propose a robust graph matching (RGM) model to improve the effectiveness and robustness on the matching graphs with deformations, rotations, outliers, and noise. First, we embed the joint geometric transformation into the graph matching model, which performs unary matching over graph nodes and local structure matching over graph edges simultaneously. Then, the L2,1 -norm is used as the similarity metric in the presented RGM to enhance the robustness. Finally, we derive an objective function which can be solved by an effective optimization algorithm, and theoretically prove the convergence of the proposed algorithm. Extensive experiments on various graph matching tasks, such as outliers, rotations, and deformations show that the proposed RGM model achieves competitive performance compared to the existing methods.
Collapse
|
14
|
Abstract
The development of single-cell RNA-sequencing (scRNA-seq) technologies brings tremendous opportunities for quantitative research and analyses at the cellular level. In particular, as a crucial task of scRNA-seq analysis, single cell clustering shines a light on natural groupings of cells to give new insights into the biological mechanisms and disease studies. However, it remains a challenge to identify cell clusters from lots of cell mixtures effectively and accurately. In this paper, we propose a novel adaptive joint clustering framework, named the low-rank self-representation K-means method (LRSK), to learn the data representation matrix and cluster indicator matrix jointly from scRNA-seq data. Specifically, instead of calculating the similarities among cells from the original data, we seek a low-rank representation of the original data to better reflect the underlying relationships among cells. Moreover, an Augmented Lagrangian Multiplier (ALM) based optimization algorithm is adopted to solve this problem. Experimental results on various scRNA-seq datasets and case studies demonstrate that our method performs better than other state-of-the-art single cell clustering algorithms. The analysis of unlabeled large single-cell liver cancer sequencing data further shows that our prediction results are more reasonable and interpretable.
Collapse
Affiliation(s)
- Ye-Sen Sun
- Intelligent Data Center, School of Mathematics, Sun Yat-sen University, Guangzhou, China.
| | | | | |
Collapse
|
15
|
Abstract
As a powerful approach for exploratory data analysis, unsupervised clustering is a fundamental task in computer vision and pattern recognition. Many clustering algorithms have been developed, but most of them perform unsatisfactorily on the data with complex structures. Recently, adversarial autoencoder (AE) (AAE) shows effectiveness on tackling such data by combining AE and adversarial training, but it cannot effectively extract classification information from the unlabeled data. In this brief, we propose dual AAE (Dual-AAE) which simultaneously maximizes the likelihood function and mutual information between observed examples and a subset of latent variables. By performing variational inference on the objective function of Dual-AAE, we derive a new reconstruction loss which can be optimized by training a pair of AEs. Moreover, to avoid mode collapse, we introduce the clustering regularization term for the category variable. Experiments on four benchmarks show that Dual-AAE achieves superior performance over state-of-the-art clustering methods. In addition, by adding a reject option, the clustering accuracy of Dual-AAE can reach that of supervised CNN algorithms. Dual-AAE can also be used for disentangling style and content of images without using supervised information.
Collapse
|
16
|
Ren CX, Luo YW, Xu XL, Dai DQ, Yan H. Discriminative Residual Analysis for Image Set Classification with Posture and Age Variations. IEEE Trans Image Process 2019; 29:2875-2888. [PMID: 31765312 DOI: 10.1109/tip.2019.2954176] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Image set recognition has been widely applied in many practical problems like real-time video retrieval and image caption tasks. Due to its superior performance, it has grown into a significant topic in recent years. However, images with complicated variations, e.g., postures and human ages, are difficult to address, as these variations are continuous and gradual with respect to image appearance. Consequently, the crucial point of image set recognition is to mine the intrinsic connection or structural information from the image batches with variations. In this work, a Discriminant Residual Analysis (DRA) method is proposed to improve the classification performance by discovering discriminant features in related and unrelated groups. Specifically, DRA attempts to obtain a powerful projection which casts the residual representations into a discriminant subspace. Such a projection subspace is expected to magnify the useful information of the input space as much as possible, then the relation between the training set and the test set described by the given metric or distance will be more precise in the discriminant subspace. We also propose a nonfeasance strategy by defining another approach to construct the unrelated groups, which help to reduce furthermore the cost of sampling errors. Two regularization approaches are used to deal with the probable small sample size problem. Extensive experiments are conducted on benchmark databases, and the results show superiority and efficiency of the new methods.
Collapse
|
17
|
Lai ZR, Dai DQ, Ren CX, Huang KK. Radial Basis Functions With Adaptive Input and Composite Trend Representation for Portfolio Selection. IEEE Trans Neural Netw Learn Syst 2018; 29:6214-6226. [PMID: 29993753 DOI: 10.1109/tnnls.2018.2827952] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We propose a set of novel radial basis functions with adaptive input and composite trend representation (AICTR) for portfolio selection (PS). Trend representation of asset price is one of the main information to be exploited in PS. However, most state-of-the-art trend representation-based systems exploit only one kind of trend information and lack effective mechanisms to construct a composite trend representation. The proposed system exploits a set of RBFs with multiple trend representations, which improves the effectiveness and robustness in price prediction. Moreover, the input of the RBFs automatically switches to the best trend representation according to the recent investing performance of different price predictions. We also propose a novel objective to combine these RBFs and select the portfolio. Extensive experiments on six benchmark data sets (including a new challenging data set that we propose) from different real-world stock markets indicate that the proposed RBFs effectively combine different trend representations and AICTR achieves state-of-the-art investing performance and risk control. Besides, AICTR withstands the reasonable transaction costs and runs fast; hence, it is applicable to real-world financial environments.
Collapse
|
18
|
Lai ZR, Dai DQ, Ren CX, Huang KK. A Peak Price Tracking-Based Learning System for Portfolio Selection. IEEE Trans Neural Netw Learning Syst 2018; 29:2823-2832. [PMID: 28600267 DOI: 10.1109/tnnls.2017.2705658] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
We propose a novel linear learning system based on the peak price tracking (PPT) strategy for portfolio selection (PS). Recently, the topic of tracking control attracts intensive attention and some novel models are proposed based on backstepping methods, such that the system output tracks a desired trajectory. The proposed system has a similar evolution with a transform function that aggressively tracks the increasing power of different assets. As a result, the better performing assets will receive more investment. The proposed PPT objective can be formulated as a fast backpropagation algorithm, which is suitable for large-scale and time-limited applications, such as high-frequency trading. Extensive experiments on several benchmark data sets from diverse real financial markets show that PPT outperforms other state-of-the-art systems in computational time, cumulative wealth, and risk-adjusted metrics. It suggests that PPT is effective and even more robust than some defensive systems in PS.
Collapse
|
19
|
Yu YF, Ren CX, Dai DQ, Huang KK. Kernel Embedding Multiorientation Local Pattern for Image Representation. IEEE Trans Cybern 2018; 48:1124-1135. [PMID: 28368841 DOI: 10.1109/tcyb.2017.2682272] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Local feature descriptor plays a key role in different image classification applications. Some of these methods such as local binary pattern and image gradient orientations have been proven effective to some extent. However, such traditional descriptors which only utilize single-type features, are deficient to capture the edges and orientations information and intrinsic structure information of images. In this paper, we propose a kernel embedding multiorientation local pattern (MOLP) to address this problem. For a given image, it is first transformed by gradient operators in local regions, which generate multiorientation gradient images containing edges and orientations information of different directions. Then the histogram feature which takes into account the sign component and magnitude component, is extracted to form the refined feature from each orientation gradient image. The refined feature captures more information of the intrinsic structure, and is effective for image representation and classification. Finally, the multiorientation refined features are automatically fused in the kernel embedding discriminant subspace learning model. The extensive experiments on various image classification tasks, such as face recognition, texture classification, object categorization, and palmprint recognition show that MOLP could achieve competitive performance with those state-of-the art methods.
Collapse
|
20
|
Huang KK, Dai DQ, Ren CX, Lai ZR. Learning Kernel Extended Dictionary for Face Recognition. IEEE Trans Neural Netw Learn Syst 2017; 28:1082-1094. [PMID: 26890929 DOI: 10.1109/tnnls.2016.2522431] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A sparse representation classifier (SRC) and a kernel discriminant analysis (KDA) are two successful methods for face recognition. An SRC is good at dealing with occlusion, while a KDA does well in suppressing intraclass variations. In this paper, we propose kernel extended dictionary (KED) for face recognition, which provides an efficient way for combining KDA and SRC. We first learn several kernel principal components of occlusion variations as an occlusion model, which can represent the possible occlusion variations efficiently. Then, the occlusion model is projected by KDA to get the KED, which can be computed via the same kernel trick as new testing samples. Finally, we use structured SRC for classification, which is fast as only a small number of atoms are appended to the basic dictionary, and the feature dimension is low. We also extend KED to multikernel space to fuse different types of features at kernel level. Experiments are done on several large-scale data sets, demonstrating that not only does KED get impressive results for nonoccluded samples, but it also handles the occlusion well without overfitting, even with a single gallery sample per subject.
Collapse
|
21
|
Ren CX, Lei Z, Dai DQ, Li SZ. Enhanced Local Gradient Order Features and Discriminant Analysis for Face Recognition. IEEE Trans Cybern 2016; 46:2656-2669. [PMID: 26513817 DOI: 10.1109/tcyb.2015.2484356] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Robust descriptor-based subspace learning with complex data is an active topic in pattern analysis and machine intelligence. A few researches concentrate the optimal design on feature representation and metric learning. However, traditionally used features of single-type, e.g., image gradient orientations (IGOs), are deficient to characterize the complete variations in robust and discriminant subspace learning. Meanwhile, discontinuity in edge alignment and feature match are not been carefully treated in the literature. In this paper, local order constrained IGOs are exploited to generate robust features. As the difference-based filters explicitly consider the local contrasts within neighboring pixel points, the proposed features enhance the local textures and the order-based coding ability, thus discover intrinsic structure of facial images further. The multimodal features are automatically fused in the most discriminant subspace. The utilization of adaptive interaction function suppresses outliers in each dimension for robust similarity measurement and discriminant analysis. The sparsity-driven regression model is modified to adapt the classification issue of the compact feature representation. Extensive experiments are conducted by using some benchmark face data sets, e.g., of controlled and uncontrolled environments, to evaluate our new algorithm.
Collapse
|
22
|
Ou-Yang L, Zhang XF, Dai DQ, Wu MY, Zhu Y, Liu Z, Yan H. Protein complex detection based on partially shared multi-view clustering. BMC Bioinformatics 2016; 17:371. [PMID: 27623844 PMCID: PMC5022186 DOI: 10.1186/s12859-016-1164-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Accepted: 07/23/2016] [Indexed: 01/05/2023] Open
Abstract
Background Protein complexes are the key molecular entities to perform many essential biological functions. In recent years, high-throughput experimental techniques have generated a large amount of protein interaction data. As a consequence, computational analysis of such data for protein complex detection has received increased attention in the literature. However, most existing works focus on predicting protein complexes from a single type of data, either physical interaction data or co-complex interaction data. These two types of data provide compatible and complementary information, so it is necessary to integrate them to discover the underlying structures and obtain better performance in complex detection. Results In this study, we propose a novel multi-view clustering algorithm, called the Partially Shared Multi-View Clustering model (PSMVC), to carry out such an integrated analysis. Unlike traditional multi-view learning algorithms that focus on mining either consistent or complementary information embedded in the multi-view data, PSMVC can jointly explore the shared and specific information inherent in different views. In our experiments, we compare the complexes detected by PSMVC from single data source with those detected from multiple data sources. We observe that jointly analyzing multi-view data benefits the detection of protein complexes. Furthermore, extensive experiment results demonstrate that PSMVC performs much better than 16 state-of-the-art complex detection techniques, including ensemble clustering and data integration techniques. Conclusions In this work, we demonstrate that when integrating multiple data sources, using partially shared multi-view clustering model can help to identify protein complexes which are not readily identifiable by conventional single-view-based methods and other integrative analysis methods. All the results and source codes are available on https://github.com/Oyl-CityU/PSMVC. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1164-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Le Ou-Yang
- College of Information Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, 518060, China.,Department of Electronic and Engineering, City University of Hong Kong, Tat Chee Avenue, Hong Kong, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics and Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, 430079, China
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xin Gang Road West, Guangzhou, 510275, China.
| | - Meng-Yun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Guoding Road, Shanghai, 200433, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China
| | - Zhiyong Liu
- Shenzhen Polytechnic, Shenzhen, 518055, China
| | - Hong Yan
- Department of Electronic and Engineering, City University of Hong Kong, Tat Chee Avenue, Hong Kong, China
| |
Collapse
|
23
|
Zhang XF, Ou-Yang L, Dai DQ, Wu MY, Zhu Y, Yan H. Comparative analysis of housekeeping and tissue-specific driver nodes in human protein interaction networks. BMC Bioinformatics 2016; 17:358. [PMID: 27612563 PMCID: PMC5016887 DOI: 10.1186/s12859-016-1233-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2015] [Accepted: 08/31/2016] [Indexed: 12/31/2022] Open
Abstract
Background Several recent studies have used the Minimum Dominating Set (MDS) model to identify driver nodes, which provide the control of the underlying networks, in protein interaction networks. There may exist multiple MDS configurations in a given network, thus it is difficult to determine which one represents the real set of driver nodes. Because these previous studies only focus on static networks and ignore the contextual information on particular tissues, their findings could be insufficient or even be misleading. Results In this study, we develop a Collective-Influence-corrected Minimum Dominating Set (CI-MDS) model which takes into account the collective influence of proteins. By integrating molecular expression profiles and static protein interactions, 16 tissue-specific networks are established as well. We then apply the CI-MDS model to each tissue-specific network to detect MDS proteins. It generates almost the same MDSs when it is solved using different optimization algorithms. In addition, we classify MDS proteins into Tissue-Specific MDS (TS-MDS) proteins and HouseKeeping MDS (HK-MDS) proteins based on the number of tissues in which they are expressed and identified as MDS proteins. Notably, we find that TS-MDS proteins and HK-MDS proteins have significantly different topological and functional properties. HK-MDS proteins are more central in protein interaction networks, associated with more functions, evolving more slowly and subjected to a greater number of post-translational modifications than TS-MDS proteins. Unlike TS-MDS proteins, HK-MDS proteins significantly correspond to essential genes, ageing genes, virus-targeted proteins, transcription factors and protein kinases. Moreover, we find that besides HK-MDS proteins, many TS-MDS proteins are also linked to disease related genes, suggesting the tissue specificity of human diseases. Furthermore, functional enrichment analysis reveals that HK-MDS proteins carry out universally necessary biological processes and TS-MDS proteins usually involve in tissue-dependent functions. Conclusions Our study uncovers key features of TS-MDS proteins and HK-MDS proteins, and is a step forward towards a better understanding of the controllability of human interactomes. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1233-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Luoyu Road, Wuhan, 430079, China
| | - Le Ou-Yang
- College of Information Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, 518060, China
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang West Road, Guangzhou, 510275, China.
| | - Meng-Yun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Guoding Road, Shanghai, 200433, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Lumo Road, Wuhan, 430074, China
| | - Hong Yan
- Department of Electronic and Engineering, City University of Hong Kong, Tat Chee Avenue, Hong Kong, China
| |
Collapse
|
24
|
Abstract
More than 100 recent collections of Valsaria sensu lato mostly from Europe were used to elucidate the species composition within the genus. Multigene phylogeny based on SSU, LSU, ITS, rpb2 and tef1 sequences revealed a monophyletic group of ten species within the Dothideomycetes, belonging to three morphologically similar genera. This group could not be accommodated in any known family and are thus classified in the new family Valsariaceae and the new order Valsariales. The genus Valsaria sensu stricto comprises V. insitiva, V. robiniae, V. rudis, V. spartii, V. lopadostomoides sp. nov. and V. neotropica sp. nov., which are phylogenetically well-defined, but morphologically nearly indistinguishable species. The new monotypic genus Bambusaria is introduced to accommodate Valsaria bambusae. Munkovalsaria rubra and Valsaria fulvopruinata are combined in Myrmaecium, a genus traditionally treated as a synonym of Valsaria, which comprises three species, with M. rubricosum as its generic type. This work is presented as a basis for additional species to be detected in future.
Collapse
Affiliation(s)
- W M Jaklitsch
- Division of Systematic and Evolutionary Botany, Department of Botany and Biodiversity Research, University of Vienna, Rennweg 14, 1030 Wien, Austria; Institute of Forest Entomology, Forest Pathology and Forest Protection, Department of Forest and Soil Sciences, BOKU University of Natural Resources and Life Sciences, Hasenauerstraße 38, 1190 Vienna, Austria
| | | | - D Q Dai
- Institute of Excellence in Fungal Research and School of Science, Mae Fah Luang University, Chiang Rai 57100, Thailand
| | - K D Hyde
- Institute of Excellence in Fungal Research and School of Science, Mae Fah Luang University, Chiang Rai 57100, Thailand
| | - H Voglmayr
- Division of Systematic and Evolutionary Botany, Department of Botany and Biodiversity Research, University of Vienna, Rennweg 14, 1030 Wien, Austria; Institute of Forest Entomology, Forest Pathology and Forest Protection, Department of Forest and Soil Sciences, BOKU University of Natural Resources and Life Sciences, Hasenauerstraße 38, 1190 Vienna, Austria
| |
Collapse
|
25
|
Wu MY, Zhang XF, Dai DQ, Ou-Yang L, Zhu Y, Yan H. Regularized logistic regression with network-based pairwise interaction for biomarker identification in breast cancer. BMC Bioinformatics 2016; 17:108. [PMID: 26921029 PMCID: PMC4769543 DOI: 10.1186/s12859-016-0951-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 01/28/2016] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND To facilitate advances in personalized medicine, it is important to detect predictive, stable and interpretable biomarkers related with different clinical characteristics. These clinical characteristics may be heterogeneous with respect to underlying interactions between genes. Usually, traditional methods just focus on detection of differentially expressed genes without taking the interactions between genes into account. Moreover, due to the typical low reproducibility of the selected biomarkers, it is difficult to give a clear biological interpretation for a specific disease. Therefore, it is necessary to design a robust biomarker identification method that can predict disease-associated interactions with high reproducibility. RESULTS In this article, we propose a regularized logistic regression model. Different from previous methods which focus on individual genes or modules, our model takes gene pairs, which are connected in a protein-protein interaction network, into account. A line graph is constructed to represent the adjacencies between pairwise interactions. Based on this line graph, we incorporate the degree information in the model via an adaptive elastic net, which makes our model less dependent on the expression data. Experimental results on six publicly available breast cancer datasets show that our method can not only achieve competitive performance in classification, but also retain great stability in variable selection. Therefore, our model is able to identify the diagnostic and prognostic biomarkers in a more robust way. Moreover, most of the biomarkers discovered by our model have been verified in biochemical or biomedical researches. CONCLUSIONS The proposed method shows promise in the diagnosis of disease pathogenesis with different clinical characteristics. These advances lead to more accurate and stable biomarker discovery, which can monitor the functional changes that are perturbed by diseases. Based on these predictions, researchers may be able to provide suggestions for new therapeutic approaches.
Collapse
Affiliation(s)
- Meng-Yun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Guoding Road, Shanghai, 200433, China. .,Key Laboratory of Mathematical Economics SUFE, Ministry of Education, Guoding Road, Shanghai, 200433, China.
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Luoyu Road, Wuhan, 430079, China.
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang West Road, Guangzhou, 510275, China.
| | - Le Ou-Yang
- College of Information Engineering, Shenzhen University, Nanhai Avenue, Shenzhen, 518060, China.
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Lumo Road, Wuhan, 430074, China.
| | - Hong Yan
- Department of Electronic and Engineering, City University of Hong Kong, Tat Chee Avenue, Hong Kong, 999077, China.
| |
Collapse
|
26
|
Ou-Yang L, Wu M, Zhang XF, Dai DQ, Li XL, Yan H. A two-layer integration framework for protein complex detection. BMC Bioinformatics 2016; 17:100. [PMID: 26911324 PMCID: PMC4765032 DOI: 10.1186/s12859-016-0939-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2015] [Accepted: 01/27/2016] [Indexed: 01/05/2023] Open
Abstract
Background Protein complexes carry out nearly all signaling and functional processes within cells. The study of protein complexes is an effective strategy to analyze cellular functions and biological processes. With the increasing availability of proteomics data, various computational methods have recently been developed to predict protein complexes. However, different computational methods are based on their own assumptions and designed to work on different data sources, and various biological screening methods have their unique experiment conditions, and are often different in scale and noise level. Therefore, a single computational method on a specific data source is generally not able to generate comprehensive and reliable prediction results. Results In this paper, we develop a novel Two-layer INtegrative Complex Detection (TINCD) model to detect protein complexes, leveraging the information from both clustering results and raw data sources. In particular, we first integrate various clustering results to construct consensus matrices for proteins to measure their overall co-complex propensity. Second, we combine these consensus matrices with the co-complex score matrix derived from Tandem Affinity Purification/Mass Spectrometry (TAP) data and obtain an integrated co-complex similarity network via an unsupervised metric fusion method. Finally, a novel graph regularized doubly stochastic matrix decomposition model is proposed to detect overlapping protein complexes from the integrated similarity network. Conclusions Extensive experimental results demonstrate that TINCD performs much better than 21 state-of-the-art complex detection techniques, including ensemble clustering and data integration techniques. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0939-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Le Ou-Yang
- College of Information Engineering, Shenzhen University, Shenzhen, 518060, China. .,Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China. .,Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China.
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, 1 Fusionopolis Way, Singapore, Singapore.
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, 430079, China.
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China.
| | - Xiao-Li Li
- Institute for Infocomm Research (I2R), A*STAR, 1 Fusionopolis Way, Singapore, Singapore.
| | - Hong Yan
- Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China.
| |
Collapse
|
27
|
|
28
|
Ou-Yang L, Dai DQ, Zhang XF. Detecting Protein Complexes from Signed Protein-Protein Interaction Networks. IEEE/ACM Trans Comput Biol Bioinform 2015; 12:1333-1344. [PMID: 26671805 DOI: 10.1109/tcbb.2015.2401014] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Identification of protein complexes is fundamental for understanding the cellular functional organization. With the accumulation of physical protein-protein interaction (PPI) data, computational detection of protein complexes from available PPI networks has drawn a lot of attentions. While most of the existing protein complex detection algorithms focus on analyzing the physical protein-protein interaction network, none of them take into account the "signs" (i.e., activation-inhibition relationships) of physical interactions. As the "signs" of interactions reflect the way proteins communicate, considering the "signs" of interactions can not only increase the accuracy of protein complex identification, but also deepen our understanding of the mechanisms of cell functions. In this study, we proposed a novel Signed Graph regularized Nonnegative Matrix Factorization (SGNMF) model to identify protein complexes from signed PPI networks. In our experiments, we compared the results collected by our model on signed PPI networks with those predicted by the state-of-the-art complex detection techniques on the original unsigned PPI networks. We observed that considering the "signs" of interactions significantly benefits the detection of protein complexes. Furthermore, based on the predicted complexes, we predicted a set of signed complex-complex interactions for each dataset, which provides a novel insight of the higher level organization of the cell. All the experimental results and codes can be downloaded from http://mail.sysu.edu.cn/home/stsddq@mail.sysu.edu.cn/dai/others/SGNMF.zip.
Collapse
|
29
|
Zhang XF, Ou-Yang L, Hu X, Dai DQ. Identifying binary protein-protein interactions from affinity purification mass spectrometry data. BMC Genomics 2015; 16:745. [PMID: 26438428 PMCID: PMC4595009 DOI: 10.1186/s12864-015-1944-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2014] [Accepted: 09/22/2015] [Indexed: 02/04/2023] Open
Abstract
Background The identification of protein-protein interactions contributes greatly to the understanding of functional organization within cells. With the development of affinity purification-mass spectrometry (AP-MS) techniques, several computational scoring methods have been proposed to detect protein interactions from AP-MS data. However, most of the current methods focus on the detection of co-complex interactions and do not discriminate between direct physical interactions and indirect interactions. Consequently, less is known about the precise physical wiring diagram within cells. Results In this paper, we develop a Binary Interaction Network Model (BINM) to computationally identify direct physical interactions from co-complex interactions which can be inferred from purification data using previous scoring methods. This model provides a mathematical framework for capturing topological relationships between direct physical interactions and observed co-complex interactions. It reassigns a confidence score to each observed interaction to indicate its propensity to be a direct physical interaction. Then observed interactions with high confidence scores are predicted as direct physical interactions. We run our model on two yeast co-complex interaction networks which are constructed by two different scoring methods on a same combined AP-MS data. The direct physical interactions identified by various methods are comprehensively benchmarked against different reference sets that provide both direct and indirect evidence for physical contacts. Experiment results show that our model has a competitive performance over the state-of-the-art methods. Conclusions According to the results obtained in this study, BINM is a powerful scoring method that can solely use network topology to predict direct physical interactions from AP-MS data. This study provides us an alternative approach to explore the information inherent in AP-MS data. The software can be downloaded from https://github.com/Zhangxf-ccnu/BINM. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1944-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- School of Mathematics and Statistics, Central China Normal University, Luoyu Road, Wuhan, 430079, China.
| | - Le Ou-Yang
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang West Road, Guangzhou, 510275, China.
| | - Xiaohua Hu
- School of Computer, Central China Normal University, 774 Luoyu Road, Wuhan, 430079, China. .,College of Information Science and Technology, Drexel University, Chestnut Street, Philadelphia, 19104, USA.
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang West Road, Guangzhou, 510275, China.
| |
Collapse
|
30
|
Abstract
In this paper, we propose a novel discriminative and compact coding (DCC) for robust face recognition. It introduces multiple error measurements into regression model. They collaborate to tune regression codes of different properties (sparsity, compactness, high discriminating ability, etc.), to further improve robustness and adaptivity of the regression model. We propose two types of coding models: 1) multiscale error measurements that produces sparse and highly discriminative codes and 2) inspires within-class collaborative representation that produces sparse and compact codes. The update of codes and the combination of different errors are automatically processed. DCC is also robust to the choice of parameters, producing stable regression residuals which are crucial to classification. Extensive experiments on benchmark datasets show that DCC has promising performance and outperforms other state-of-the-art regression models.
Collapse
|
31
|
Lai ZR, Dai DQ, Ren CX, Huang KK. Multiscale logarithm difference edgemaps for face recognition against varying lighting conditions. IEEE Trans Image Process 2015; 24:1735-1747. [PMID: 25751866 DOI: 10.1109/tip.2015.2409988] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Lambertian model is a classical illumination model consisting of a surface albedo component and a light intensity component. Some previous researches assume that the light intensity component mainly lies in the large-scale features. They adopt holistic image decompositions to separate it out, but it is difficult to decide the separating point between large-scale and small-scale features. In this paper, we propose to take a logarithm transform, which can change the multiplication of surface albedo and light intensity into an additive model. Then, a difference (substraction) between two pixels in a neighborhood can eliminate most of the light intensity component. By dividing a neighborhood into subregions, edgemaps of multiple scales can be obtained. Then, each edgemap is multiplied by a weight that can be determined by an independent training scheme. Finally, all the weighted edgemaps are combined to form a robust holistic feature map. Extensive experiments on four benchmark data sets in controlled and uncontrolled lighting conditions show that the proposed method has promising results, especially in uncontrolled lighting conditions, even mixed with other complicated variations.
Collapse
|
32
|
Zhang XF, Ou-Yang L, Zhu Y, Wu MY, Dai DQ. Determining minimum set of driver nodes in protein-protein interaction networks. BMC Bioinformatics 2015; 16:146. [PMID: 25947063 PMCID: PMC4428234 DOI: 10.1186/s12859-015-0591-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Accepted: 04/22/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recently, several studies have drawn attention to the determination of a minimum set of driver proteins that are important for the control of the underlying protein-protein interaction (PPI) networks. In general, the minimum dominating set (MDS) model is widely adopted. However, because the MDS model does not generate a unique MDS configuration, multiple different MDSs would be generated when using different optimization algorithms. Therefore, among these MDSs, it is difficult to find out the one that represents the true driver set of proteins. RESULTS To address this problem, we develop a centrality-corrected minimum dominating set (CC-MDS) model which includes heterogeneity in degree and betweenness centralities of proteins. Both the MDS model and the CC-MDS model are applied on three human PPI networks. Unlike the MDS model, the CC-MDS model generates almost the same sets of driver proteins when we implement it using different optimization algorithms. The CC-MDS model targets more high-degree and high-betweenness proteins than the uncorrected counterpart. The more central position allows CC-MDS proteins to be more important in maintaining the overall network connectivity than MDS proteins. To indicate the functional significance, we find that CC-MDS proteins are involved in, on average, more protein complexes and GO annotations than MDS proteins. We also find that more essential genes, aging genes, disease-associated genes and virus-targeted genes appear in CC-MDS proteins than in MDS proteins. As for the involvement in regulatory functions, the sets of CC-MDS proteins show much stronger enrichment of transcription factors and protein kinases. The results about topological and functional significance demonstrate that the CC-MDS model can capture more driver proteins than the MDS model. CONCLUSIONS Based on the results obtained, the CC-MDS model presents to be a powerful tool for the determination of driver proteins that can control the underlying PPI networks. The software described in this paper and the datasets used are available at https://github.com/Zhangxf-ccnu/CC-MDS .
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- School of Mathematics and Statistics, Central China Normal University, Luoyu Road, Wuhan, 430079, China.
| | - Le Ou-Yang
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang West Road, Guangzhou, 510275, China.
| | - Yuan Zhu
- School of Mathematics and Statistics, Guangdong University of Finance and Economics, ChiSha Road, Guangzhou, 510320, China.
| | - Meng-Yun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Guoding Road, Shanghai, 200433, China.
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang West Road, Guangzhou, 510275, China.
| |
Collapse
|
33
|
Abstract
Face recognition under uncontrolled conditions, e.g., complex backgrounds and variable resolutions, is still challenging in image processing and computer vision. Although many methods have been proved well-performed in the controlled settings, they are usually of weak generality across different data sets. Meanwhile, several properties of the source domain, such as background and the size of subjects, play an important role in determining the final classification results. A transferrable representation learning model is proposed in this paper to enhance the recognition performance. To deeply exploit the discriminant information from the source domain and the target domain, the bioinspired face representation is modeled as structured and approximately stable characterization for the commonality between different domains. The method outputs a grouped boost of the features, and presents a reasonable manner for highlighting and sharing discriminant orientations and scales. Notice that the method can be viewed as a framework, since other feature generation operators and classification metrics can be embedded therein, and then, it can be applied to more general problems, such as low-resolution face recognition, object detection and categorization, and so forth. Experiments on the benchmark databases, including uncontrolled Face Recognition Grand Challenge v2.0 and Labeled Faces in the Wild show the efficacy of the proposed transfer learning algorithm.
Collapse
|
34
|
Lai ZR, Dai DQ, Ren CX, Huang KK. Multilayer surface albedo for face recognition with reference images in bad lighting conditions. IEEE Trans Image Process 2014; 23:4709-4723. [PMID: 25216483 DOI: 10.1109/tip.2014.2356292] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In this paper, we propose a multilayer surface albedo (MLSA) model to tackle face recognition in bad lighting conditions, especially with reference images in bad lighting conditions. Some previous researches conclude that illumination variations mainly lie in the large-scale features of an image and extract small-scale features in the surface albedo (or surface texture). However, this surface albedo is not robust enough, which still contains some detrimental sharp features. To improve robustness of the surface albedo, MLSA further decomposes it as a linear sum of several detailed layers, to separate and represent features of different scales in a more specific way. Then, the layers are adjusted by separate weights, which are global parameters and selected for only once. A criterion function is developed to select these layer weights with an independent training set. Despite controlled illumination variations, MLSA is also effective to uncontrolled illumination variations, even mixed with other complicated variations (expression, pose, occlusion, and so on). Extensive experiments on four benchmark data sets show that MLSA has good receiver operating characteristic curve and statistical discriminating capability. The refined albedo improves recognition performance, especially with reference images in bad lighting conditions.
Collapse
|
35
|
Ou-Yang L, Dai DQ, Li XL, Wu M, Zhang XF, Yang P. Detecting temporal protein complexes from dynamic protein-protein interaction networks. BMC Bioinformatics 2014; 15:335. [PMID: 25282536 PMCID: PMC4288635 DOI: 10.1186/1471-2105-15-335] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2014] [Accepted: 09/23/2014] [Indexed: 12/13/2022] Open
Abstract
Background Proteins dynamically interact with each other to perform their biological functions. The dynamic operations of protein interaction networks (PPI) are also reflected in the dynamic formations of protein complexes. Existing protein complex detection algorithms usually overlook the inherent temporal nature of protein interactions within PPI networks. Systematically analyzing the temporal protein complexes can not only improve the accuracy of protein complex detection, but also strengthen our biological knowledge on the dynamic protein assembly processes for cellular organization. Results In this study, we propose a novel computational method to predict temporal protein complexes. Particularly, we first construct a series of dynamic PPI networks by joint analysis of time-course gene expression data and protein interaction data. Then a Time Smooth Overlapping Complex Detection model (TS-OCD) has been proposed to detect temporal protein complexes from these dynamic PPI networks. TS-OCD can naturally capture the smoothness of networks between consecutive time points and detect overlapping protein complexes at each time point. Finally, a nonnegative matrix factorization based algorithm is introduced to merge those very similar temporal complexes across different time points. Conclusions Extensive experimental results demonstrate the proposed method is very effective in detecting temporal protein complexes than the state-of-the-art complex detection techniques. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-335) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.
| | | | | | | | | |
Collapse
|
36
|
Yang Z, Li DM, Xie Q, Dai DQ. Protein expression and promoter methylation of the candidate biomarker TCF21 in gastric cancer. J Cancer Res Clin Oncol 2014; 141:211-20. [PMID: 25156819 DOI: 10.1007/s00432-014-1809-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2014] [Accepted: 08/04/2014] [Indexed: 01/18/2023]
Abstract
PURPOSE Transcription factor 21 (TCF21) has been identified as a candidate tumor suppressor at 6q23-q24 that is epigenetically inactivated in many types of human cancers. This study aimed to determine the expression of TCF21 mRNA and protein in gastric cancer cell lines and tissue specimens and then investigate the prognostic impact of TCF21 expression in gastric cancer and analyze the relationship between TCF21 expression and methylation level. METHODS We used real-time PCR and immunohistochemical staining to detect the expression of TCF21 and used methylation-specific-PCR to determine the methylation status of TCF21 in gastric cancer samples and gastric cancer cell lines. RESULTS The results showed that TCF21 expression level in gastric cancer samples was significantly lower than in normal adjacent tissue samples. The Kaplan-Meier survival analysis demonstrated that TCF21 was a significant prognosticator of cancer-specific survival (p = 0.001). Furthermore, the methylation level of TCF21 in gastric cancer samples was much higher than the samples in normal adjacent tissue. Treatment with the DNA methyltransferase inhibitor 5-Aza-2'-deoxy-cytidine can upregulate the expression of TCF21 in gastric cancer cells. CONCLUSIONS These results suggest that the low expression of TCF21 was an independent prognostic factor for poor survival in patients with gastric cancer. Aberrant methylation was an important reason for the downregulation of TCF21 and may be associated with tumorigenesis in gastric cancer.
Collapse
Affiliation(s)
- Z Yang
- Cancer Center, The Fourth Hospital of China Medical University, Shenyang, China
| | | | | | | |
Collapse
|
37
|
Zhang XF, Dai DQ, Ou-Yang L, Yan H. Detecting overlapping protein complexes based on a generative model with functional and topological properties. BMC Bioinformatics 2014; 15:186. [PMID: 24928559 PMCID: PMC4073817 DOI: 10.1186/1471-2105-15-186] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2014] [Accepted: 06/09/2014] [Indexed: 11/20/2022] Open
Abstract
Background Identification of protein complexes can help us get a better understanding of cellular mechanism. With the increasing availability of large-scale protein-protein interaction (PPI) data, numerous computational approaches have been proposed to detect complexes from the PPI networks. However, most of the current approaches do not consider overlaps among complexes or functional annotation information of individual proteins. Therefore, they might not be able to reflect the biological reality faithfully or make full use of the available domain-specific knowledge. Results In this paper, we develop a Generative Model with Functional and Topological Properties (GMFTP) to describe the generative processes of the PPI network and the functional profile. The model provides a working mechanism for capturing the interaction structures and the functional patterns of proteins. By combining the functional and topological properties, we formulate the problem of identifying protein complexes as that of detecting a group of proteins which frequently interact with each other in the PPI network and have similar annotation patterns in the functional profile. Using the idea of link communities, our method naturally deals with overlaps among complexes. The benefits brought by the functional properties are demonstrated by real data analysis. The results evaluated using four criteria with respect to two gold standards show that GMFTP has a competitive performance over the state-of-the-art approaches. The effectiveness of detecting overlapping complexes is also demonstrated by analyzing the topological and functional features of multi- and mono-group proteins. Conclusions Based on the results obtained in this study, GMFTP presents to be a powerful approach for the identification of overlapping protein complexes using both the PPI network and the functional profile. The software can be downloaded from
http://mail.sysu.edu.cn/home/stsddq@mail.sysu.edu.cn/dai/others/GMFTP.zip.
Collapse
Affiliation(s)
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang Road West, 510275 Guangzhou, China.
| | | | | |
Collapse
|
38
|
Ren CX, Dai DQ, Li XX, Lai ZR. Band-Reweighed Gabor Kernel Embedding for Face Image Representation and Recognition. IEEE Trans Image Process 2014; 23:725-740. [PMID: 26270914 DOI: 10.1109/tip.2013.2292560] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Face recognition with illumination or pose variation is a challenging problem in image processing and pattern recognition. A novel algorithm using band-reweighed Gabor kernel embedding to deal with the problem is proposed in this paper. For a given image, it is first transformed by a group of Gabor filters, which output Gabor features using different orientation and scale parameters. Fisher scoring function is used to measure the importance of features in each band, and then, the features with the largest scores are preserved for saving memory requirements. The reduced bands are combined by a vector, which is determined by a weighted kernel discriminant criterion and solved by a constrained quadratic programming method, and then, the weighted sum of these nonlinear bands is defined as the similarity between two images. Compared with existing concatenation-based Gabor feature representation and the uniformly weighted similarity calculation approaches, our method provides a new way to use Gabor features for face recognition and presents a reasonable interpretation for highlighting discriminant orientations and scales. The minimum Mahalanobis distance considering the spatial correlations within the data is exploited for feature matching, and the graphical lasso is used therein for directly estimating the sparse inverse covariance matrix. Experiments using benchmark databases show that our new algorithm improves the recognition results and obtains competitive performance.
Collapse
|
39
|
Zhu Y, Zhou W, Dai DQ, Yan H. Identification of DNA-binding and protein-binding proteins using enhanced graph wavelet features. IEEE/ACM Trans Comput Biol Bioinform 2013; 10:1017-1031. [PMID: 24334394 DOI: 10.1109/tcbb.2013.117] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Interactions between biomolecules play an essential role in various biological processes. For predicting DNA-binding or protein-binding proteins, many machine-learning-based techniques have used various types of features to represent the interface of the complexes, but they only deal with the properties of a single atom in the interface and do not take into account the information of neighborhood atoms directly. This paper proposes a new feature representation method for biomolecular interfaces based on the theory of graph wavelet. The enhanced graph wavelet features (EGWF) provides an effective way to characterize interface feature through adding physicochemical features and exploiting a graph wavelet formulation. Particularly, graph wavelet condenses the information around the center atom, and thus enhances the discrimination of features of biomolecule binding proteins in the feature space. Experiment results show that EGWF performs effectively for predicting DNA-binding and protein-binding proteins in terms of Matthew's correlation coefficient (MCC) score and the area value under the receiver operating characteristic curve (AUC).
Collapse
Affiliation(s)
- Yuan Zhu
- Guangdong University of Finance and Economics, Guangzhou and Sun Yat-Sen University, Guangzhou
| | | | | | - Hong Yan
- City University of Hong Kong, Hong Kong and University of Sydney, Sydney
| |
Collapse
|
40
|
Wu MY, Dai DQ, Zhang XF, Zhu Y. Cancer subtype discovery and biomarker identification via a new robust network clustering algorithm. PLoS One 2013; 8:e66256. [PMID: 23799085 PMCID: PMC3684607 DOI: 10.1371/journal.pone.0066256] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2013] [Accepted: 05/02/2013] [Indexed: 11/29/2022] Open
Abstract
In cancer biology, it is very important to understand the phenotypic changes of the patients and discover new cancer subtypes. Recently, microarray-based technologies have shed light on this problem based on gene expression profiles which may contain outliers due to either chemical or electrical reasons. These undiscovered subtypes may be heterogeneous with respect to underlying networks or pathways, and are related with only a few of interdependent biomarkers. This motivates a need for the robust gene expression-based methods capable of discovering such subtypes, elucidating the corresponding network structures and identifying cancer related biomarkers. This study proposes a penalized model-based Student’s t clustering with unconstrained covariance (PMT-UC) to discover cancer subtypes with cluster-specific networks, taking gene dependencies into account and having robustness against outliers. Meanwhile, biomarker identification and network reconstruction are achieved by imposing an adaptive penalty on the means and the inverse scale matrices. The model is fitted via the expectation maximization algorithm utilizing the graphical lasso. Here, a network-based gene selection criterion that identifies biomarkers not as individual genes but as subnetworks is applied. This allows us to implicate low discriminative biomarkers which play a central role in the subnetwork by interconnecting many differentially expressed genes, or have cluster-specific underlying network structures. Experiment results on simulated datasets and one available cancer dataset attest to the effectiveness, robustness of PMT-UC in cancer subtype discovering. Moveover, PMT-UC has the ability to select cancer related biomarkers which have been verified in biochemical or biomedical research and learn the biological significant correlation among genes.
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Dao-Qing Dai
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- * E-mail:
| | - Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Yuan Zhu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- Department of Mathematics, Guangdong University of Business Studies, Guangzhou, China
| |
Collapse
|
41
|
Ou-Yang L, Dai DQ, Zhang XF. Protein complex detection via weighted ensemble clustering based on Bayesian nonnegative matrix factorization. PLoS One 2013; 8:e62158. [PMID: 23658709 PMCID: PMC3642239 DOI: 10.1371/journal.pone.0062158] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2012] [Accepted: 03/18/2013] [Indexed: 12/05/2022] Open
Abstract
Detecting protein complexes from protein-protein interaction (PPI) networks is a challenging task in computational biology. A vast number of computational methods have been proposed to undertake this task. However, each computational method is developed to capture one aspect of the network. The performance of different methods on the same network can differ substantially, even the same method may have different performance on networks with different topological characteristic. The clustering result of each computational method can be regarded as a feature that describes the PPI network from one aspect. It is therefore desirable to utilize these features to produce a more accurate and reliable clustering. In this paper, a novel Bayesian Nonnegative Matrix Factorization (NMF)-based weighted Ensemble Clustering algorithm (EC-BNMF) is proposed to detect protein complexes from PPI networks. We first apply different computational algorithms on a PPI network to generate some base clustering results. Then we integrate these base clustering results into an ensemble PPI network, in the form of weighted combination. Finally, we identify overlapping protein complexes from this network by employing Bayesian NMF model. When generating an ensemble PPI network, EC-BNMF can automatically optimize the values of weights such that the ensemble algorithm can deliver better results. Experimental results on four PPI networks of Saccharomyces cerevisiae well verify the effectiveness of EC-BNMF in detecting protein complexes. EC-BNMF provides an effective way to integrate different clustering results for more accurate and reliable complex detection. Furthermore, EC-BNMF has a high degree of flexibility in the choice of base clustering results. It can be coupled with existing clustering methods to identify protein complexes.
Collapse
Affiliation(s)
- Le Ou-Yang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Dao-Qing Dai
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
42
|
Abstract
Face recognition with occlusion is common in the real world. Inspired by the works of structured sparse representation, we try to explore the structure of the error incurred by occlusion from two aspects: the error morphology and the error distribution. Since human beings recognize the occlusion mainly according to its region shape or profile without knowing accurately what the occlusion is, we argue that the shape of the occlusion is also an important feature. We propose a morphological graph model to describe the morphological structure of the error. Due to the uncertainty of the occlusion, the distribution of the error incurred by occlusion is also uncertain. However, we observe that the unoccluded part and the occluded part of the error measured by the correntropy induced metric follow the exponential distribution, respectively. Incorporating the two aspects of the error structure, we propose the structured sparse error coding for face recognition with occlusion. Our extensive experiments demonstrate that the proposed method is more stable and has higher breakdown point in dealing with the occlusion problems in face recognition as compared to the related state-of-the-art methods, especially for the extreme situation, such as the high level occlusion and the low feature dimension.
Collapse
Affiliation(s)
- Xiao-Xin Li
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.
| | | | | | | |
Collapse
|
43
|
Zhu Y, Zhang XF, Dai DQ, Wu MY. Identifying spurious interactions and predicting missing interactions in the protein-protein interaction networks via a generative network model. IEEE/ACM Trans Comput Biol Bioinform 2013; 10:219-225. [PMID: 23702559 DOI: 10.1109/tcbb.2012.164] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
With the rapid development of high-throughput experiment techniques for protein-protein interaction (PPI) detection, a large amount of PPI network data are becoming available. However, the data produced by these techniques have high levels of spurious and missing interactions. This study assigns a new reliably indication for each protein pairs via the new generative network model (RIGNM) where the scale-free property of the PPI network is considered to reliably identify both spurious and missing interactions in the observed high-throughput PPI network. The experimental results show that the RIGNM is more effective and interpretable than the compared methods, which demonstrate that this approach has the potential to better describe the PPI networks and drive new discoveries.
Collapse
Affiliation(s)
- Yuan Zhu
- Department of Mathematics, Guangdong University of Business Studies, Guangzhou, China.
| | | | | | | |
Collapse
|
44
|
Wu MY, Dai DQ, Shi Y, Yan H, Zhang XF. Biomarker identification and cancer classification based on microarray data using Laplace naive Bayes model with mean shrinkage. IEEE/ACM Trans Comput Biol Bioinform 2012; 9:1649-1662. [PMID: 22868679 DOI: 10.1109/tcbb.2012.105] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University,Guangzhou 510275, China.
| | | | | | | | | |
Collapse
|
45
|
Zhang XF, Dai DQ, Ou-Yang L, Wu MY. Exploring overlapping functional units with various structure in protein interaction networks. PLoS One 2012; 7:e43092. [PMID: 22916212 PMCID: PMC3423443 DOI: 10.1371/journal.pone.0043092] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2012] [Accepted: 07/16/2012] [Indexed: 11/18/2022] Open
Abstract
Revealing functional units in protein-protein interaction (PPI) networks are important for understanding cellular functional organization. Current algorithms for identifying functional units mainly focus on cohesive protein complexes which have more internal interactions than external interactions. Most of these approaches do not handle overlaps among complexes since they usually allow a protein to belong to only one complex. Moreover, recent studies have shown that other non-cohesive structural functional units beyond complexes also exist in PPI networks. Thus previous algorithms that just focus on non-overlapping cohesive complexes are not able to present the biological reality fully. Here, we develop a new regularized sparse random graph model (RSRGM) to explore overlapping and various structural functional units in PPI networks. RSRGM is principally dominated by two model parameters. One is used to define the functional units as groups of proteins that have similar patterns of connections to others, which allows RSRGM to detect non-cohesive structural functional units. The other one is used to represent the degree of proteins belonging to the units, which supports a protein belonging to more than one revealed unit. We also propose a regularizer to control the smoothness between the estimators of these two parameters. Experimental results on four S. cerevisiae PPI networks show that the performance of RSRGM on detecting cohesive complexes and overlapping complexes is superior to that of previous competing algorithms. Moreover, RSRGM has the ability to discover biological significant functional units besides complexes.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Dao-Qing Dai
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- * E-mail:
| | - Le Ou-Yang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
46
|
Abstract
Practical video scene and face recognition systems are sometimes confronted with low-resolution (LR) images. The faces may be very small even if the video is clear, thus it is difficult to directly measure the similarity between the faces and the high-resolution (HR) training samples. Traditional super-resolution (SR) methods based face recognition usually have limited performance because the target of SR may not be consistent with that of classification, and time-consuming SR algorithms are not suitable for real-time applications. In this paper, a new feature extraction method called Coupled Kernel Embedding (CKE) is proposed for LR face recognition without any SR preprocessing. In this method, the final kernel matrix is constructed by concatenating two individual kernel matrices in the diagonal direction, and the (semi-)positively definite properties are preserved for optimization. CKE addresses the problem of comparing multi-modal data that are difficult for conventional methods in practice due to the lack of an efficient similarity measure. Particularly, different kernel types (e.g., linear, Gaussian, polynomial) can be integrated into an uniformed optimization objective, which cannot be achieved by simple linear methods. CKE solves this problem by minimizing the dissimilarities captured by their kernel Gram matrices in the low- and high-resolution spaces. In the implementation, the nonlinear objective function is minimized by a generalized eigenvalue decomposition. Experiments on benchmark and real databases show that our CKE method indeed improves the recognition performance.
Collapse
|
47
|
Wu MY, Dai DQ, Yan H. PRL-dock: Protein-ligand docking based on hydrogen bond matching and probabilistic relaxation labeling. Proteins 2012; 80:2137-53. [DOI: 10.1002/prot.24104] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2012] [Revised: 04/14/2012] [Accepted: 04/17/2012] [Indexed: 11/08/2022]
|
48
|
Zhang XF, Dai DQ. A framework for incorporating functional interrelationships into protein function prediction algorithms. IEEE/ACM Trans Comput Biol Bioinform 2012; 9:740-753. [PMID: 22084148 DOI: 10.1109/tcbb.2011.148] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The functional annotation of proteins is one of the most important tasks in the post-genomic era. Although many computational approaches have been developed in recent years to predict protein function, most of these traditional algorithms do not take interrelationships among functional terms into account, such as different GO terms usually coannotate with some common proteins. In this study, we propose a new functional similarity measure in the form of Jaccard coefficient to quantify these interrelationships and also develop a framework for incorporating GO term similarity into protein function prediction process. The experimental results of cross-validation on S. cerevisiae and Homo sapiens data sets demonstrate that our method is able to improve the performance of protein function prediction. In addition, we find that small size terms associated with a few of proteins obtain more benefit than the large size ones when considering functional interrelationships. We also compare our similarity measure with other two widely used measures, and results indicate that when incorporated into function prediction algorithms, our proposed measure is more effective. Experiment results also illustrate that our algorithms outperform two previous competing algorithms, which also take functional interrelationships into account, in prediction accuracy. Finally, we show that our method is robust to annotations in the database which are not complete at present. These results give new insights about the importance of functional interrelationships in protein function prediction.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.
| | | |
Collapse
|
49
|
Zhang XF, Dai DQ, Li XX. Protein complexes discovery based on protein-protein interaction data via a regularized sparse generative network model. IEEE/ACM Trans Comput Biol Bioinform 2012; 9:857-870. [PMID: 22291160 DOI: 10.1109/tcbb.2012.20] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Detecting protein complexes from protein interaction networks is one major task in the postgenome era. Previous developed computational algorithms identifying complexes mainly focus on graph partition or dense region finding. Most of these traditional algorithms cannot discover overlapping complexes which really exist in the protein-protein interaction (PPI) networks. Even if some density-based methods have been developed to identify overlapping complexes, they are not able to discover complexes that include peripheral proteins. In this study, motivated by recent successful application of generative network model to describe the generation process of PPI networks and to detect communities from social networks, we develop a regularized sparse generative network model (RSGNM), by adding another process that generates propensities using exponential distribution and incorporating Laplacian regularizer into an existing generative network model, for protein complexes identification. By assuming that the propensities are generated using exponential distribution, the estimators of propensities will be sparse, which not only has good biological interpretation but also helps to control the overlapping rate among detected complexes. And the Laplacian regularizer will lead to the estimators of propensities more smooth on interaction networks. Experimental results on three yeast PPI networks show that RSGNM outperforms six previous competing algorithms in terms of the quality of detected complexes. In addition, RSGNM is able to detect overlapping complexes and complexes including peripheral proteins simultaneously. These results give new insights about the importance of generative network models in protein complexes identification.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.
| | | | | |
Collapse
|
50
|
Zhang WF, Tang G, Dai DQ, Nehorai A. Estimation of reflectance from camera responses by the regularized local linear model. Opt Lett 2011; 36:3933-5. [PMID: 21964146 DOI: 10.1364/ol.36.003933] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Because of the limited approximation capability of using fixed basis functions, the performance of reflectance estimation obtained by traditional linear models will not be optimal. We propose an approach based on the regularized local linear model. Our approach performs efficiently and knowledge of the spectral power distribution of the illuminant and the spectral sensitivities of the camera is not needed. Experimental results show that the proposed method performs better than some well-known methods in terms of both reflectance error and colorimetric error.
Collapse
Affiliation(s)
- Wei-Feng Zhang
- Department of Applied Mathematics, South China Agricultural University, Guangzhou 510642, China.
| | | | | | | |
Collapse
|