1
|
Zhu A, Wang Z, Xue J, Wan X, Jin J, Wang T, Snoussi H. Improving Text-Based Person Retrieval by Excavating All-Round Information Beyond Color. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5097-5111. [PMID: 38416620 DOI: 10.1109/tnnls.2024.3368217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Text-based person retrieval is the process of searching a massive visual resource library for images of a particular pedestrian, based on a textual query. Existing approaches often suffer from a problem of color (CLR) over-reliance, which can result in a suboptimal person retrieval performance by distracting the model from other important visual cues such as texture and structure information. To handle this problem, we propose a novel framework to Excavate All-round Information Beyond Color for the task of text-based person retrieval, which is therefore termed EAIBC. The EAIBC architecture includes four branches, namely an RGB branch, a grayscale (GRS) branch, a high-frequency (HFQ) branch, and a CLR branch. Furthermore, we introduce a mutual learning (ML) mechanism to facilitate communication and learning among the branches, enabling them to take full advantage of all-round information in an effective and balanced manner. We evaluate the proposed method on three benchmark datasets, including CUHK-PEDES, ICFG-PEDES, and RSTPReid. The experimental results demonstrate that EAIBC significantly outperforms existing methods and achieves state-of-the-art (SOTA) performance in supervised, weakly supervised, and cross-domain settings.
Collapse
|
2
|
Choudhury S, Saini D, Banerjee B. Boosting cross-modal retrieval in remote sensing via a novel unified attention network. Neural Netw 2024; 180:106718. [PMID: 39293179 DOI: 10.1016/j.neunet.2024.106718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 08/19/2024] [Accepted: 09/07/2024] [Indexed: 09/20/2024]
Abstract
With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, SPCA-Net to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, label-specific supervised contrastive loss, tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.
Collapse
Affiliation(s)
| | | | - Biplab Banerjee
- Indian Institute of Technology, Bombay, India; Centre for Machine Intelligence and Data Science (C-MInDS), Bombay, India
| |
Collapse
|
3
|
Liang X, Yang E, Yang Y, Deng C. Multi-Relational Deep Hashing for Cross-Modal Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3009-3020. [PMID: 38625760 DOI: 10.1109/tip.2024.3385656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2024]
Abstract
Deep cross-modal hashing retrieval has recently made significant progress. However, existing methods generally learn hash functions with pairwise or triplet supervisions, which involves learning the relevant information by splicing partial similarity between data pairs; notably, this approach only captures the data similarity locally and incompletely, resulting in sub-optimal retrieval performance. In this paper, we propose a novel Multi-Relational Deep Hashing (MRDH) approach, which can fully bridge the modality gap by comprehensively modeling the similarity relationship between data in different modalities. In more detail, to investigate the inter-modal relationships, we constrain the consistency of cross-modal pairwise similarities to maintain the semantic similarity across modalities. Moreover, to further capture complete similarity information, we design a new similarity metric, which we term cross-modal global similarity, by encouraging hash codes of similar data pairs from different modalities to approach a common center and hash codes for dissimilar pairs to converge to different centers. Adopting this approach enables our model to generate more discriminative hash codes. Extensive experiments on three benchmark datasets demonstrate the superiority of our method on cross-modal hashing retrieval.
Collapse
|
4
|
Zhang M, Li J, Zheng X. Semantic embedding based online cross-modal hashing method. Sci Rep 2024; 14:736. [PMID: 38184671 PMCID: PMC10771426 DOI: 10.1038/s41598-023-50242-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 12/17/2023] [Indexed: 01/08/2024] Open
Abstract
Hashing has been extensively utilized in cross-modal retrieval due to its high efficiency in handling large-scale, high-dimensional data. However, most existing cross-modal hashing methods operate as offline learning models, which learn hash codes in a batch-based manner and prove to be inefficient for streaming data. Recently, several online cross-modal hashing methods have been proposed to address the streaming data scenario. Nevertheless, these methods fail to fully leverage the semantic information and accurately optimize hashing in a discrete fashion. As a result, both the accuracy and efficiency of online cross-modal hashing methods are not ideal. To address these issues, this paper introduces the Semantic Embedding-based Online Cross-modal Hashing (SEOCH) method, which integrates semantic information exploitation and online learning into a unified framework. To exploit the semantic information, we map the semantic labels to a latent semantic space and construct a semantic similarity matrix to preserve the similarity between new data and existing data in the Hamming space. Moreover, we employ a discrete optimization strategy to enhance the efficiency of cross-modal retrieval for online hashing. Through extensive experiments on two publicly available multi-label datasets, we demonstrate the superiority of the SEOCH method.
Collapse
Affiliation(s)
- Meijia Zhang
- School of Data Science and Computer Science, Shandong Women's University, Jinan, 250300, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, 250022, China
| | - Junzheng Li
- Network Information Management Center, Shandong Management University, Jinan, 250357, China
| | - Xiyuan Zheng
- School of Data Science and Computer Science, Shandong Women's University, Jinan, 250300, China.
| |
Collapse
|
5
|
Sun Y, Wang X, Peng D, Ren Z, Shen X. Hierarchical Hashing Learning for Image Set Classification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1732-1744. [PMID: 37028051 DOI: 10.1109/tip.2023.3251025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
With the development of video network, image set classification (ISC) has received a lot of attention and can be used for various practical applications, such as video based recognition, action recognition, and so on. Although the existing ISC methods have obtained promising performance, they often have extreme high complexity. Due to the superiority in storage space and complexity cost, learning to hash becomes a powerful solution scheme. However, existing hashing methods often ignore complex structural information and hierarchical semantics of the original features. They usually adopt a single-layer hashing strategy to transform high-dimensional data into short-length binary codes in one step. This sudden drop of dimension could result in the loss of advantageous discriminative information. In addition, they do not take full advantage of intrinsic semantic knowledge from whole gallery sets. To tackle these problems, in this paper, we propose a novel Hierarchical Hashing Learning (HHL) for ISC. Specifically, a coarse-to-fine hierarchical hashing scheme is proposed that utilizes a two-layer hash function to gradually refine the beneficial discriminative information in a layer-wise fashion. Besides, to alleviate the effects of redundant and corrupted features, we impose the $\ell _{2,1}$ norm on the layer-wise hash function. Moreover, we adopt a bidirectional semantic representation with the orthogonal constraint to keep intrinsic semantic information of all samples in whole image sets adequately. Comprehensive experiments demonstrate HHL acquires significant improvements in accuracy and running time. We will release the demo code on https://github.com/sunyuan-cs.
Collapse
|
6
|
Joint and Individual Feature Fusion Hashing for Multi-modal Retrieval. Cognit Comput 2022. [DOI: 10.1007/s12559-022-10086-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
7
|
Chaudhuri U, Chavan R, Banerjee B, Dutta A, Akata Z. BDA-SketRet: Bi-level domain adaptation for zero-shot SBIR. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
8
|
Fast unsupervised consistent and modality-specific hashing for multimedia retrieval. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-08008-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Shu Z, Yong K, Yu J, Gao S, Mao C, Yu Z. Discrete asymmetric zero-shot hashing with application to cross-modal retrieval. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
10
|
Yang F, Ding X, Liu Y, Ma F, Cao J. Scalable semantic-enhanced supervised hashing for cross-modal retrieval. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
11
|
Jiang W, Hu H. Hadamard Product Perceptron Attention for Image Captioning. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10980-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
12
|
Zhao Z, Huang Z, Chai X, Wang J. Depth Enhanced Cross-Modal Cascaded Network for RGB-D Salient Object Detection. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10886-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
13
|
Semi-Supervised Cross-Modal Hashing with Multi-view Graph Representation. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.05.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
14
|
Jin M, Zhang H, Zhu L, Sun J, Liu L. Coarse-to-fine dual-level attention for video-text cross modal retrieval. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
15
|
Shi Y, Nie X, Liu X, Zou L, Yin Y. Supervised Adaptive Similarity Matrix Hashing. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2755-2766. [PMID: 35320101 DOI: 10.1109/tip.2022.3158092] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Compact hash codes can facilitate large-scale multimedia retrieval, significantly reducing storage and computation. Most hashing methods learn hash functions based on the data similarity matrix, which is predefined by supervised labels or a distance metric type. However, this predefined similarity matrix cannot accurately reflect the real similarity relationship among images, which results in poor retrieval performance of hashing methods, especially in multi-label datasets and zero-shot datasets that are highly dependent on similarity relationships. Toward this end, this study proposes a new supervised hashing method called supervised adaptive similarity matrix hashing (SASH) via feature-label space consistency. SASH not only learns the similarity matrix adaptively, but also extracts the label correlations by maintaining consistency between the feature and the label space. This correlation information is then used to optimize the similarity matrix. The experiments on three large normal benchmark datasets (including two multi-label datasets) and three large zero-shot benchmark datasets show that SASH has an excellent performance compared with several state-of-the-art techniques.
Collapse
|
16
|
Chu H, Zeng H, Lai H, Tang Y. Efficient modal-aware feature learning with application in multimodal hashing. INTELL DATA ANAL 2022. [DOI: 10.3233/ida-215780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Many retrieval applications can benefit from multiple modalities, for which how to represent multimodal data is the critical component. Most deep multimodal learning methods typically involve two steps to construct the joint representations: 1) learning of multiple intermediate features, with each intermediate feature corresponding to a modality, using separate and independent deep models; 2) merging the intermediate features into a joint representation using a fusion strategy. However, in the first step, these intermediate features do not have previous knowledge of each other and cannot fully exploit the information contained in the other modalities. In this paper, we present a modal-aware operation as a generic building block to capture the non-linear dependencies among the heterogeneous intermediate features, which can learn the underlying correlation structures in other multimodal data as soon as possible. The modal-aware operation consists of a kernel network and an attention network. The kernel network is utilized to learn the non-linear relationships with other modalities. The attention network finds the informative regions of these modal-aware features that are favorable for retrieval. We verify the proposed modal-aware feature learning in the multimodal hashing task. The experiments conducted on three public benchmark datasets demonstrate significant improvements in the performance of our method relative to state-of-the-art methods.
Collapse
Affiliation(s)
- Hanlu Chu
- School of Computer Science, South China Normal University, Guangdong, China
- School of Computer Science, South China Normal University, Guangdong, China
| | - Haien Zeng
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangdong, China
- School of Computer Science, South China Normal University, Guangdong, China
| | - Hanjiang Lai
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangdong, China
| | - Yong Tang
- School of Computer Science, South China Normal University, Guangdong, China
| |
Collapse
|
17
|
Zou X, Wu S, Zhang N, Bakker EM. Multi-label modality enhanced attention based self-supervised deep cross-modal hashing. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107927] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
18
|
Zhang F, Xu M, Xu C. Geometry Sensitive Cross-Modal Reasoning for Composed Query Based Image Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1000-1011. [PMID: 34971533 DOI: 10.1109/tip.2021.3138302] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Composed Query Based Image Retrieval (CQBIR) aims at retrieving images relevant to a composed query containing a reference image with a requested modification expressed via a textual sentence. Compared with the conventional image retrieval which takes one modality as query to retrieve relevant data of another modality, CQBIR poses great challenge over the semantic gap between the reference image and modification text in the composed query. To solve the challenge, previous methods either resort to feature composition that cannot model interactions in the query or explore inter-modal attention while ignoring the spatial structure and visual-semantic relationship. In this paper, we propose a geometry sensitive cross-modal reasoning network for CQBIR by jointly modeling the geometric information of the image and the visual-semantic relationship between the reference image and modification text in the query. Specifically, it contains two key components: a geometry sensitive inter-modal attention module (GS-IMA) and a text-guided visual reasoning module (TG-VR). The GS-IMA introduces the spatial structure into the inter-modal attention in both implicit and explicit manners. The TG-VR models the unequal semantics not included in the reference image to guide further visual reasoning. As a result, our method can learn effective feature for the composed query which does not exhibit literal alignment. Comprehensive experimental results on three standard benchmarks demonstrate that the proposed model performs favorably against state-of-the-art methods.
Collapse
|