1
|
Li Y, Qin Y, Sun Y, Peng D, Peng X, Hu P. RoMo: Robust Unsupervised Multimodal Learning With Noisy Pseudo Labels. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:5086-5097. [PMID: 39190513 DOI: 10.1109/tip.2024.3426482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]
Abstract
The rise of the metaverse and the increasing volume of heterogeneous 2D and 3D data have created a growing demand for cross-modal retrieval, enabling users to query semantically relevant data across different modalities. Existing methods heavily rely on class labels to bridge semantic correlations; however, collecting large-scale, well-labeled data is expensive and often impractical, making unsupervised learning more attractive and feasible. Nonetheless, unsupervised cross-modal learning faces challenges in bridging semantic correlations due to the lack of label information, leading to unreliable discrimination. In this paper, we reveal and study a novel problem: unsupervised cross-modal learning with noisy pseudo-labels. To address this issue, we propose a 2D-3D unsupervised multimodal learning framework that leverages multimodal data. Our framework consists of three key components: 1) Self-matching Supervision Mechanism (SSM) warms up the model to encapsulate discrimination into the representations in a self-supervised learning manner. 2) Robust Discriminative Learning (RDL) further mines the discrimination from the learned imperfect predictions after warming up. To tackle the noise in the predicted pseudo labels, RDL leverages a novel Robust Concentrating Learning Loss (RCLL) to alleviate the influence of the uncertain samples, thus embracing robustness against noisy pseudo labels. 3) Modality-invariance Learning Mechanism (MLM) minimizes the cross-modal discrepancy to enforce SSM and RDL to produce common representations. We conduct comprehensive experiments on four 2D-3D multimodal datasets, comparing our method against 14 state-of-the-art approaches, thereby demonstrating its effectiveness and superiority.
Collapse
|
2
|
Chen C, Ye M, Qi M, Du B. SketchTrans: Disentangled Prototype Learning With Transformer for Sketch-Photo Recognition. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:2950-2964. [PMID: 38010930 DOI: 10.1109/tpami.2023.3337005] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Matching hand-drawn sketches with photos (a.k.a sketch-photo recognition or re-identification) faces the information asymmetry challenge due to the abstract nature of the sketch modality. Existing works tend to learn shared embedding spaces with CNN models by discarding the appearance cues for photo images or introducing GAN for sketch-photo synthesis. The former unavoidably loses discriminability, while the latter contains ineffaceable generation noise. In this paper, we start the first attempt to design an information-aligned sketch transformer (SketchTrans +) via cross-modal disentangled prototype learning, while the transformer has shown great promise for discriminative visual modelling. Specifically, we design an asymmetric disentanglement scheme with a dynamic updatable auxiliary sketch (A-sketch) to align the modality representations without sacrificing information. The asymmetric disentanglement decomposes the photo representations into sketch-relevant and sketch-irrelevant cues, transferring sketch-irrelevant knowledge into the sketch modality to compensate for the missing information. Moreover, considering the feature discrepancy between the two modalities, we present a modality-aware prototype contrastive learning method that mines representative modality-sharing information using the modality-aware prototypes rather than the original feature representations. Extensive experiments on category- and instance-level sketch-based datasets validate the superiority of our proposed method under various metrics.
Collapse
|
3
|
Zhao H, Liu M, Li M. Feature Fusion and Metric Learning Network for Zero-Shot Sketch-Based Image Retrieval. ENTROPY (BASEL, SWITZERLAND) 2023; 25:502. [PMID: 36981390 PMCID: PMC10047869 DOI: 10.3390/e25030502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Revised: 03/05/2023] [Accepted: 03/06/2023] [Indexed: 06/18/2023]
Abstract
Zero-shot sketch-based image retrieval (ZS-SBIR) is an important computer vision problem. The image category in the test phase is a new category that was not visible in the training stage. Because sketches are extremely abstract, the commonly used backbone networks (such as VGG-16 and ResNet-50) cannot handle both sketches and photos. Semantic similarities between the same features in photos and sketches are difficult to reflect in deep models without textual assistance. To solve this problem, we propose a novel and effective feature embedding model called Attention Map Feature Fusion (AMFF). The AMFF model combines the excellent feature extraction capability of the ResNet-50 network with the excellent representation ability of the attention network. By processing the residuals of the ResNet-50 network, the attention map is finally obtained without introducing external semantic knowledge. Most previous approaches treat the ZS-SBIR problem as a classification problem, which ignores the huge domain gap between sketches and photos. This paper proposes an effective method to optimize the entire network, called domain-aware triplets (DAT). Domain feature discrimination and semantic feature embedding can be learned through DAT. In this paper, we also use the classification loss function to stabilize the training process to avoid getting trapped in a local optimum. Compared with the state-of-the-art methods, our method shows a superior performance. For example, on the Tu-berlin dataset, we achieved 61.2 + 1.2% Prec200. On the Sketchy_c100 dataset, we achieved 62.3 + 3.3% mAPall and 75.5 + 1.5% Prec100.
Collapse
Affiliation(s)
- Honggang Zhao
- School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
| | - Mingyue Liu
- School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
| | - Mingyong Li
- School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
- Chongqing National Center for Applied Mathematics, Chongqing 401331, China
| |
Collapse
|
4
|
Wang H, Deng C, Liu T, Tao D. Transferable Coupled Network for Zero-Shot Sketch-Based Image Retrieval. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:9181-9194. [PMID: 34705637 DOI: 10.1109/tpami.2021.3123315] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) aims at searching corresponding natural images with the given free-hand sketches, under the more realistic and challenging scenario of Zero-Shot Learning (ZSL). Prior works concentrate much on aligning the sketch and image feature representations while ignoring the explicit learning of heterogeneous feature extractors to make themselves capable of aligning multi-modal features, with the expense of deteriorating the transferability from seen categories to unseen ones. To address this issue, we propose a novel Transferable Coupled Network (TCN) to effectively improve network transferability, with the constraint of soft weight-sharing among heterogeneous convolutional layers to capture similar geometric patterns, e.g., contours of sketches and images. Based on this, we further introduce and validate a general criterion to deal with multi-modal zero-shot learning, i.e., utilizing coupled modules for mining modality-common knowledge while independent modules for learning modality-specific information. Moreover, we elaborate a simple but effective semantic metric to integrate local metric learning and global semantic constraint into a unified formula to significantly boost the performance. Extensive experiments on three popular large-scale datasets show that our proposed approach outperforms state-of-the-art methods to a remarkable extent: by more than 12% on Sketchy, 2% on TU-Berlin and 6% on QuickDraw datasets in terms of retrieval accuracy. The project page is available at: https://haowang1992.github.io/publication/TCN.
Collapse
|
5
|
Gao C, Cai G, Jiang X, Zheng F, Zhang J, Gong Y, Lin F, Sun X, Bai X. Conditional Feature Learning Based Transformer for Text-Based Person Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6097-6108. [PMID: 36103442 DOI: 10.1109/tip.2022.3205216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Text-based person search aims at retrieving the target person in an image gallery using a descriptive sentence of that person. The core of this task is to calculate a similarity score between the pedestrian image and description, which requires inferring the complex latent correspondence between image sub-regions and textual phrases at different scales. Transformer is an intuitive way to model the complex alignment by its self-attention mechanism. Most previous Transformer-based methods simply concatenate image region features and text features as input and learn a cross-modal representation in a brute force manner. Such weakly supervised learning approaches fail to explicitly build alignment between image region features and text features, causing an inferior feature distribution. In this paper, we present CFLT, Conditional Feature Learning based Transformer. It maps the sub-regions and phrases into a unified latent space and explicitly aligns them by constructing conditional embeddings where the feature of data from one modality is dynamically adjusted based on the data from the other modality. The output of our CFLT is a set of similarity scores for each sub-region or phrase rather than a cross-modal representation. Furthermore, we propose a simple and effective multi-modal re-ranking method named Re-ranking scheme by Visual Conditional Feature (RVCF). Benefit from the visual conditional feature and better feature distribution in our CFLT, the proposed RVCF achieves significant performance improvement. Experimental results show that our CFLT outperforms the state-of-the-art methods by 7.03% in terms of top-1 accuracy and 5.01% in terms of top-5 accuracy on the text-based person search dataset.
Collapse
|
6
|
Qin J, Fei L, Zhang Z, Wen J, Xu Y, Zhang D. Joint Specifics and Consistency Hash Learning for Large-Scale Cross-Modal Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:5343-5358. [PMID: 35925845 DOI: 10.1109/tip.2022.3195059] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
With the dramatic increase in the amount of multimedia data, cross-modal similarity retrieval has become one of the most popular yet challenging problems. Hashing offers a promising solution for large-scale cross-modal data searching by embedding the high-dimensional data into the low-dimensional similarity preserving Hamming space. However, most existing cross-modal hashing usually seeks a semantic representation shared by multiple modalities, which cannot fully preserve and fuse the discriminative modal-specific features and heterogeneous similarity for cross-modal similarity searching. In this paper, we propose a joint specifics and consistency hash learning method for cross-modal retrieval. Specifically, we introduce an asymmetric learning framework to fully exploit the label information for discriminative hash code learning, where 1) each individual modality can be better converted into a meaningful subspace with specific information, 2) multiple subspaces are semantically connected to capture consistent information, and 3) the integration complexity of different subspaces is overcome so that the learned collaborative binary codes can merge the specifics with consistency. Then, we introduce an alternatively iterative optimization to tackle the specifics and consistency hashing learning problem, making it scalable for large-scale cross-modal retrieval. Extensive experiments on five widely used benchmark databases clearly demonstrate the effectiveness and efficiency of our proposed method on both one-cross-one and one-cross-two retrieval tasks.
Collapse
|
7
|
Zhang Z, Li Z, Wei K, Pan S, Deng C. A survey on multimodal-guided visual content synthesis. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
8
|
Tao L, Mi L, Li N, Cheng X, Hu Y, Chen Z. Predicate Correlation Learning for Scene Graph Generation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4173-4185. [PMID: 35700252 DOI: 10.1109/tip.2022.3181511] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
For a typical Scene Graph Generation (SGG) method in image understanding, there usually exists a large gap in the performance of the predicates' head classes and tail classes. This phenomenon is mainly caused by the semantic overlap between different predicates as well as the long-tailed data distribution. In this paper, a Predicate Correlation Learning (PCL) method for SGG is proposed to address the above problems by taking the correlation between predicates into consideration. To measure the semantic overlap between highly correlated predicate classes, a Predicate Correlation Matrix (PCM) is defined to quantify the relationship between predicate pairs, which is dynamically updated to remove the matrix's long-tailed bias. In addition, PCM is integrated into a predicate correlation loss function ( LPC ) to reduce discouraging gradients of unannotated classes. The proposed method is evaluated on several benchmarks, where the performance of the tail classes is significantly improved when built on existing methods.
Collapse
|
9
|
Ren H, Zheng Z, Lu H. Energy-Guided Feature Fusion for Zero-Shot Sketch-Based Image Retrieval. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10881-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
10
|
Jing T, Xia H, Hamm J, Ding Z. Augmented Multimodality Fusion for Generalized Zero-Shot Sketch-Based Visual Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3657-3668. [PMID: 35576409 DOI: 10.1109/tip.2022.3173815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Zero-shot sketch-based image retrieval (ZS-SBIR) has attracted great attention recently, due to the potential application of sketch-based retrieval under zero-shot scenarios, where the categories of query sketches and gallery photos are not observed in the training stage. However, it is still under insufficient exploration for the general and practical scenario when the query sketches and gallery photos contain both seen and unseen categories. Such a problem is defined as generalized zero-shot sketch-based image retrieval (GZS-SBIR), which is the focus of this work. To this end, we propose a novel Augmented Multi-modality Fusion (AMF) framework to generalize seen concepts to unobserved ones efficiently. Specifically, a novel knowledge discovery module named cross-domain augmentation is designed in both visual and semantic space to mimic novel knowledge unseen from the training stage, which is the key to handling the GZS-SBIR challenge. Moreover, a triplet domain alignment module is proposed to couple the cross-domain distribution between photo and sketch in visual space. To enhance the robustness of our model, we explore embedding propagation to refine both visual and semantic features by removing undesired noise. Eventually, visual-semantic fusion representations are concatenated for further domain discrimination and task-specific recognition, which tend to trigger the cross-domain alignment in both visual and semantic feature space. Experimental evaluations are conducted on popular ZS-SBIR benchmarks as well as a new evaluation protocol designed for GZS-SBIR from DomainNet dataset with more diverse sub-domains, and the promising results demonstrate the superiority of the proposed solution over other baselines. The source code is available at https://github.com/scottjingtt/AMF_GZS_SBIR.git.
Collapse
|
11
|
Liu F, Deng X, Zou C, Lai YK, Chen K, Zuo R, Ma C, Liu YJ, Wang H. SceneSketcher-v2: Fine-Grained Scene-Level Sketch-Based Image Retrieval Using Adaptive GCNs. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3737-3751. [PMID: 35594232 DOI: 10.1109/tip.2022.3175403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Sketch-based image retrieval (SBIR) is a long-standing research topic in computer vision. Existing methods mainly focus on category-level or instance-level image retrieval. This paper investigates the fine-grained scene-level SBIR problem where a free-hand sketch depicting a scene is used to retrieve desired images. This problem is useful yet challenging mainly because of two entangled facts: 1) achieving an effective representation of the input query data and scene-level images is difficult as it requires to model the information across multiple modalities such as object layout, relative size and visual appearances, and 2) there is a great domain gap between the query sketch input and target images. We present SceneSketcher-v2, a Graph Convolutional Network (GCN) based architecture to address these challenges. SceneSketcher-v2 employs a carefully designed graph convolution network to fuse the multi-modality information in the query sketch and target images and uses a triplet training process and end-to-end training manner to alleviate the domain gap. Extensive experiments demonstrate SceneSketcher-v2 outperforms state-of-the-art scene-level SBIR models with a significant margin.
Collapse
|
12
|
Li H, Jiang X, Guan B, Wang R, Thalmann NM. Multistage Spatio-Temporal Networks for Robust Sketch Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2683-2694. [PMID: 35320102 DOI: 10.1109/tip.2022.3160240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Sketch recognition relies on two types of information, namely, spatial contexts like the local structures in images and temporal contexts like the orders of strokes. Existing methods usually adopt convolutional neural networks (CNNs) to model spatial contexts, and recurrent neural networks (RNNs) for temporal contexts. However, most of them combine spatial and temporal features with late fusion or single-stage transformation, which is prone to losing the informative details in sketches. To tackle this problem, we propose a novel framework that aims at the multi-stage interactions and refinements of spatial and temporal features. Specifically, given a sketch represented by a stroke array, we first generate a temporal-enriched image (TEI), which is a pseudo-color image retaining the temporal order of strokes, to overcome the difficulty of CNNs in leveraging temporal information. We then construct a dual-branch network, in which a CNN branch and a RNN branch are adopted to process the stroke array and the TEI respectively. In the early stages of our network, considering the limited ability of RNNs in capturing spatial structures, we utilize multiple enhancement modules to enhance the stroke features with the TEI features. While in the last stage of our network, we propose a spatio-temporal enhancement module that refines stroke features and TEI features in a joint feature space. Furthermore, a bidirectional temporal-compatible unit that adaptively merges features in opposite temporal orders, is proposed to help RNNs tackle abrupt strokes. Comprehensive experimental results on QuickDraw and TU-Berlin demonstrate that the proposed method is a robust and efficient solution for sketch recognition.
Collapse
|
13
|
Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07169-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
14
|
Chen Y, Huang R, Chang H, Tan C, Xue T, Ma B. Cross-Modal Knowledge Adaptation for Language-Based Person Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4057-4069. [PMID: 33788687 DOI: 10.1109/tip.2021.3068825] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this paper, we present a method named Cross-Modal Knowledge Adaptation (CMKA) for language-based person search. We argue that the image and text information are not equally important in determining a person's identity. In other words, image carries image-specific information such as lighting condition and background, while text contains more modal agnostic information that is more beneficial to cross-modal matching. Based on this consideration, we propose CMKA to adapt the knowledge of image to the knowledge of text. Specially, text-to-image guidance is obtained at different levels: individuals, lists, and classes. By combining these levels of knowledge adaptation, the image-specific information is suppressed, and the common space of image and text is better constructed. We conduct experiments on the CUHK-PEDES dataset. The experimental results show that the proposed CMKA outperforms the state-of-the-art methods.
Collapse
|