1
|
Shen X, Chen Y, Liu W, Zheng Y, Sun QS, Pan S. Graph Convolutional Multi-Label Hashing for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7997-8009. [PMID: 39028597 DOI: 10.1109/tnnls.2024.3421583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/21/2024]
Abstract
Cross-modal hashing encodes different modalities of multimodal data into low-dimensional Hamming space for fast cross-modal retrieval. In multi-label cross-modal retrieval, multimodal data are often annotated with multiple labels, and some labels, e.g., "ocean" and "cloud," often co-occur. However, existing cross-modal hashing methods overlook label dependency that is crucial for improving performance. To fulfill this gap, this article proposes graph convolutional multi-label hashing (GCMLH) for effective multi-label cross-modal retrieval. Specifically, GCMLH first generates word embedding of each label and develops label encoder to learn highly correlated label embedding via graph convolutional network (GCN). In addition, GCMLH develops feature encoder for each modality, and feature fusion module to generate highly semantic feature via GCN. GCMLH uses teacher-student learning scheme to transfer knowledge from the teacher modules, i.e., label encoder and feature fusion module, to the student module, i.e., feature encoder, such that learned hash code can well exploit multi-label dependency and multimodal semantic structure. Extensive empirical results on several benchmarks demonstrate the superiority of the proposed method over existing state-of-the-arts.
Collapse
|
2
|
Zou Q, Cheng S, Du A, Chen J. Text-Enhanced Graph Attention Hashing for Cross-Modal Retrieval. ENTROPY (BASEL, SWITZERLAND) 2024; 26:911. [PMID: 39593856 PMCID: PMC11592578 DOI: 10.3390/e26110911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Revised: 10/19/2024] [Accepted: 10/25/2024] [Indexed: 11/28/2024]
Abstract
Deep hashing technology, known for its low-cost storage and rapid retrieval, has become a focal point in cross-modal retrieval research as multimodal data continue to grow. However, existing supervised methods often overlook noisy labels and multiscale features in different modal datasets, leading to higher information entropy in the generated hash codes and features, which reduces retrieval performance. The variation in text annotation information across datasets further increases the information entropy during text feature extraction, resulting in suboptimal outcomes. Consequently, reducing the information entropy in text feature extraction, supplementing text feature information, and enhancing the retrieval efficiency of large-scale media data are critical challenges in cross-modal retrieval research. To tackle these, this paper introduces the Text-Enhanced Graph Attention Hashing for Cross-Modal Retrieval (TEGAH) framework. TEGAH incorporates a deep text feature extraction network and a multiscale label region fusion network to minimize information entropy and optimize feature extraction. Additionally, a Graph-Attention-based modal feature fusion network is designed to efficiently integrate multimodal information, enhance the affinity of the network for different modes, and retain more semantic information. Extensive experiments on three multilabel datasets demonstrate that the TEGAH framework significantly outperforms state-of-the-art cross-modal hashing methods.
Collapse
Affiliation(s)
| | - Shuli Cheng
- College of Computer Science and Technology, Xinjiang University, Urumqi 830046, China; (Q.Z.); (A.D.); (J.C.)
| | | | | |
Collapse
|
3
|
Liang X, Yang E, Yang Y, Deng C. Multi-Relational Deep Hashing for Cross-Modal Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3009-3020. [PMID: 38625760 DOI: 10.1109/tip.2024.3385656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2024]
Abstract
Deep cross-modal hashing retrieval has recently made significant progress. However, existing methods generally learn hash functions with pairwise or triplet supervisions, which involves learning the relevant information by splicing partial similarity between data pairs; notably, this approach only captures the data similarity locally and incompletely, resulting in sub-optimal retrieval performance. In this paper, we propose a novel Multi-Relational Deep Hashing (MRDH) approach, which can fully bridge the modality gap by comprehensively modeling the similarity relationship between data in different modalities. In more detail, to investigate the inter-modal relationships, we constrain the consistency of cross-modal pairwise similarities to maintain the semantic similarity across modalities. Moreover, to further capture complete similarity information, we design a new similarity metric, which we term cross-modal global similarity, by encouraging hash codes of similar data pairs from different modalities to approach a common center and hash codes for dissimilar pairs to converge to different centers. Adopting this approach enables our model to generate more discriminative hash codes. Extensive experiments on three benchmark datasets demonstrate the superiority of our method on cross-modal hashing retrieval.
Collapse
|
4
|
Hoang T, Do TT, Nguyen TV, Cheung NM. Multimodal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6289-6302. [PMID: 34982698 DOI: 10.1109/tnnls.2021.3135420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this article, we adopt the maximizing mutual information (MI) approach to tackle the problem of unsupervised learning of binary hash codes for efficient cross-modal retrieval. We proposed a novel method, dubbed cross-modal info-max hashing (CMIMH). First, to learn informative representations that can preserve both intramodal and intermodal similarities, we leverage the recent advances in estimating variational lower bound of MI to maximizing the MI between the binary representations and input features and between binary representations of different modalities. By jointly maximizing these MIs under the assumption that the binary representations are modeled by multivariate Bernoulli distributions, we can learn binary representations, which can preserve both intramodal and intermodal similarities, effectively in a mini-batch manner with gradient descent. Furthermore, we find out that trying to minimize the modality gap by learning similar binary representations for the same instance from different modalities could result in less informative representations. Hence, balancing between reducing the modality gap and losing modality-private information is important for the cross-modal retrieval tasks. Quantitative evaluations on standard benchmark datasets demonstrate that the proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
Collapse
|
5
|
Tian M, Wu X, Jia Y. Adaptive Latent Graph Representation Learning for Image-Text Matching. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 32:471-482. [PMID: 37015388 DOI: 10.1109/tip.2022.3229631] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.
Collapse
|
6
|
Wang H, Du Y, Zhang Y, Li S, Zhang L. One-Stage Visual Relationship Referring With Transformers and Adaptive Message Passing. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 32:190-202. [PMID: 37015479 DOI: 10.1109/tip.2022.3226624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
There exist a variety of visual relationships among entities in an image. Given a relationship query $\langle subject, predicate, object \rangle $ , the task of visual relationship referring (VRR) aims to disambiguate instances of the same entity category and simultaneously localize the subject and object entities in an image. Previous works of VRR can be generally categorized into one-stage and multi-stage methods. The former ones directly localize a pair of entities from the image but they suffer from low prediction accuracy, while the latter ones perform better but they are indirect to localize only a couple of entities by pre-generating a rich amount of candidate proposals. In this paper, we formulate the task of VRR as an end-to-end bounding box regression problem and propose a novel one-stage approach, called VRR-TAMP, by effectively integrating Transformers and an adaptive message passing mechanism. First, visual relationship queries and images are respectively encoded to generate the basic modality-specific embeddings, which are then fed into a cross-modal Transformer encoder to produce the joint representation. Second, to obtain the specific representation of each entity, we introduce an adaptive message passing mechanism and design an entity-specific information distiller SR-GMP, which refers to a gated message passing (GMP) module that works on the joint representation learned from a single learnable token. The GMP module adaptively distills the final representation of an entity by incorporating the contextual cues regarding the predicate and the other entity. Experiments on VRD and Visual Genome datasets demonstrate that our approach significantly outperforms its one-stage competitors and achieves competitive results with the state-of-the-art multi-stage methods.
Collapse
|
7
|
Xie Y, Zeng X, Wang T, Yi Y, Xu L. Deep online cross-modal hashing by a co-training mechanism. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
8
|
Han J, Zhang S, Men A, Chen Q. Cross-Modal Contrastive Hashing Retrieval for Infrared Video and EEG. SENSORS (BASEL, SWITZERLAND) 2022; 22:8804. [PMID: 36433399 PMCID: PMC9699584 DOI: 10.3390/s22228804] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 11/07/2022] [Accepted: 11/08/2022] [Indexed: 06/16/2023]
Abstract
It is essential to estimate the sleep quality and diagnose the clinical stages in time and at home, because they are closely related to and important causes of chronic diseases and daily life dysfunctions. However, the existing "gold-standard" sensing machine for diagnosis (Polysomnography (PSG) with Electroencephalogram (EEG) measurements) is almost infeasible to deploy at home in a "ubiquitous" manner. In addition, it is costly to train clinicians for the diagnosis of sleep conditions. In this paper, we proposed a novel technical and systematic attempt to tackle the previous barriers: first, we proposed to monitor and sense the sleep conditions using the infrared (IR) camera videos synchronized with the EEG signal; second, we proposed a novel cross-modal retrieval system termed as Cross-modal Contrastive Hashing Retrieval (CCHR) to build the relationship between EEG and IR videos, retrieving the most relevant EEG signal given an infrared video. Specifically, the CCHR is novel in the following two perspectives. Firstly, to eliminate the large cross-modal semantic gap between EEG and IR data, we designed a novel joint cross-modal representation learning strategy using a memory-enhanced hard-negative mining design under the framework of contrastive learning. Secondly, as the sleep monitoring data are large-scale (8 h long for each subject), a novel contrastive hashing module is proposed to transform the joint cross-modal features to the discriminative binary hash codes, enabling the efficient storage and inference. Extensive experiments on our collected cross-modal sleep condition dataset validated that the proposed CCHR achieves superior performances compared with existing cross-modal hashing methods.
Collapse
Affiliation(s)
- Jianan Han
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
| | | | - Aidong Men
- School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
| | - Qingchao Chen
- National Institute of Health Data Science, Peking University, Beijing 100191, China
| |
Collapse
|
9
|
Yang HF, Tu CH, Chen CS. Learning Binary Hash Codes Based on Adaptable Label Representations. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:6961-6975. [PMID: 34288878 DOI: 10.1109/tnnls.2021.3095399] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The goal of supervised hashing is to construct hash mappings from collections of images and semantic annotations such that semantically relevant images are embedded nearby in the learned binary hash representations. Existing deep supervised hashing approaches that employ classification frameworks with a classification training objective for learning hash codes often encode class labels as one-hot or multi-hot vectors. We argue that such label encodings do not well reflect semantic relations among classes and instead, effective class label representations ought to be learned from data, which could provide more discriminative signals for hashing. In this article, we introduce Adaptive Labeling Deep Hashing (AdaLabelHash) that learns binary hash codes based on learnable class label representations. We treat the class labels as the vertices of a K -dimensional hypercube, which are trainable variables and adapted together with network weights during the backward network training procedure. The label representations, referred to as codewords, are the target outputs of hash mapping learning. In the label space, semantically relevant images are then expressed by the codewords that are nearby regarding Hamming distances, yielding compact and discriminative binary hash representations. Furthermore, we find that the learned label representations well reflect semantic relations. Our approach is easy to realize and can simultaneously construct both the label representations and the compact binary embeddings. Quantitative and qualitative evaluations on several popular benchmarks validate the superiority of AdaLabelHash in learning effective binary codes for image search.
Collapse
|
10
|
Gao C, Cai G, Jiang X, Zheng F, Zhang J, Gong Y, Lin F, Sun X, Bai X. Conditional Feature Learning Based Transformer for Text-Based Person Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6097-6108. [PMID: 36103442 DOI: 10.1109/tip.2022.3205216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Text-based person search aims at retrieving the target person in an image gallery using a descriptive sentence of that person. The core of this task is to calculate a similarity score between the pedestrian image and description, which requires inferring the complex latent correspondence between image sub-regions and textual phrases at different scales. Transformer is an intuitive way to model the complex alignment by its self-attention mechanism. Most previous Transformer-based methods simply concatenate image region features and text features as input and learn a cross-modal representation in a brute force manner. Such weakly supervised learning approaches fail to explicitly build alignment between image region features and text features, causing an inferior feature distribution. In this paper, we present CFLT, Conditional Feature Learning based Transformer. It maps the sub-regions and phrases into a unified latent space and explicitly aligns them by constructing conditional embeddings where the feature of data from one modality is dynamically adjusted based on the data from the other modality. The output of our CFLT is a set of similarity scores for each sub-region or phrase rather than a cross-modal representation. Furthermore, we propose a simple and effective multi-modal re-ranking method named Re-ranking scheme by Visual Conditional Feature (RVCF). Benefit from the visual conditional feature and better feature distribution in our CFLT, the proposed RVCF achieves significant performance improvement. Experimental results show that our CFLT outperforms the state-of-the-art methods by 7.03% in terms of top-1 accuracy and 5.01% in terms of top-5 accuracy on the text-based person search dataset.
Collapse
|
11
|
Zhu L, Wang T, Li J, Zhang Z, Shen J, Wang X. Efficient Query-based Black-Box Attack against Cross-modal Hashing Retrieval. ACM T INFORM SYST 2022. [DOI: 10.1145/3559758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
Deep cross-modal hashing retrieval models inherit the vulnerability of deep neural networks. They are vulnerable to adversarial attacks, especially for the form of subtle perturbations to the inputs. Although many adversarial attack methods have been proposed to handle the robustness of hashing retrieval models, they still suffer from two problems: 1) Most of them are based on the white-box settings, which is usually unrealistic in practical application. 2) Iterative optimization for the generation of adversarial examples in them results in heavy computation. To address these problems, we propose an Efficient Query-based Black-Box Attack (EQB
2
A) against deep cross-modal hashing retrieval, which can efficiently generate adversarial examples for the black-box attack. Specifically, by sending a few query requests to the attacked retrieval system, the cross-modal retrieval model stealing is performed based on the neighbor relationship between the retrieved results and the query, thus obtaining the knockoffs to substitute the attacked system. A multi-modal knockoffs-driven adversarial generation is proposed to achieve efficient adversarial example generation. While the entire network training converges, EQB
2
A can efficiently generate adversarial examples by forward-propagation with only given benign images. Experiments show that EQB
2
A achieves superior attacking performance under the black-box setting.
Collapse
Affiliation(s)
- Lei Zhu
- Shandong Normal University Peng Cheng Laboratory, China
| | | | - Jingjing Li
- University of Electronic Science and Technology of China, China
| | - Zheng Zhang
- Harbin Institute of Technology, Shenzhen, China
| | | | | |
Collapse
|
12
|
Wei Z, Yang X, Wang N, Gao X. Flexible Body Partition-Based Adversarial Learning for Visible Infrared Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4676-4687. [PMID: 33651699 DOI: 10.1109/tnnls.2021.3059713] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Person re-identification (Re-ID) aims to retrieve images of the same person across disjoint camera views. Most Re-ID studies focus on pedestrian images captured by visible cameras, without considering the infrared images obtained in the dark scenarios. Person retrieval between visible and infrared modalities is of great significance to public security. Current methods usually train a model to extract global feature descriptors and obtain discriminative representations for visible infrared person Re-ID (VI-REID). Nevertheless, they ignore the detailed information of heterogeneous pedestrian images, which affects the performance of Re-ID. In this article, we propose a flexible body partition (FBP) model-based adversarial learning method (FBP-AL) for VI-REID. To learn more fine-grained information, FBP model is exploited to automatically distinguish part representations according to the feature maps of pedestrian images. Specially, we design a modality classifier and introduce adversarial learning which attempts to discriminate features between visible and infrared modality. Adaptive weighting-based representation learning and threefold triplet loss-based metric learning compete with modality classification to obtain more effective modality-sharable features, thus shrinking the cross-modality gap and enhancing the feature discriminability. Extensive experimental results on two cross-modality person Re-ID data sets, i.e., SYSU-MM01 and RegDB, exhibit the superiority of the proposed method compared with the state-of-the-art solutions.
Collapse
|
13
|
Xie Y, Zeng X, Wang T, Yi Y. Online deep hashing for both uni-modal and cross-modal retrieval. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.07.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
14
|
Feng F, Ming Y, Hu N. SSLNet: A network for cross-modal sound source localization in visual scenes. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
15
|
Li J, Yu E, Ma J, Chang X, Zhang H, Sun J. Discrete Fusion Adversarial Hashing for cross-modal retrieval. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
16
|
Zhang PF, Bai G, Yin H, Huang Z. Proactive Privacy-preserving Learning for Cross-modal Retrieval. ACM T INFORM SYST 2022. [DOI: 10.1145/3545799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
Deep cross-modal retrieval techniques have recently achieved remarkable performance, which also poses severe threats to data privacy potentially. Nowadays, enormous user-generated contents that convey personal information are released and shared on the Internet. One may abuse a retrieval system to pinpoint sensitive information of a particular Internet user, causing privacy leakage. In this paper, we propose a data-centric
P
roactive
P
rivacy-preserving
C
ross-modal
L
earning (PPCL) algorithm, which fulfills the protection purpose by employing a generator to transform original data into adversarial data with quasi-imperceptible perturbations before releasing them. When the data source is infiltrated, the inside adversarial data can confuse retrieval models under the attacker’s control to make erroneous predictions. We consider the protection under a realistic and challenging setting where the prior knowledge of malicious models is agnostic. To handle this, a surrogate retrieval model is instead introduced, acting as the target to fool. The whole network is trained under a game theoretical framework, where the generator and the retrieval model persistently evolve to fight against each other. To facilitate the optimization, a Gradient Reversal Layer (GRL) module is inserted between two models, enabling a one-step learning fashion. Extensive experiments on widely-used realistic datasets prove the effectiveness of the proposed method.
Collapse
Affiliation(s)
| | | | | | - Zi Huang
- The University of Queensland, Australia
| |
Collapse
|
17
|
Generative adversarial network based on semantic consistency for text-to-image generation. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03660-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
18
|
Jing T, Xia H, Hamm J, Ding Z. Augmented Multimodality Fusion for Generalized Zero-Shot Sketch-Based Visual Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3657-3668. [PMID: 35576409 DOI: 10.1109/tip.2022.3173815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Zero-shot sketch-based image retrieval (ZS-SBIR) has attracted great attention recently, due to the potential application of sketch-based retrieval under zero-shot scenarios, where the categories of query sketches and gallery photos are not observed in the training stage. However, it is still under insufficient exploration for the general and practical scenario when the query sketches and gallery photos contain both seen and unseen categories. Such a problem is defined as generalized zero-shot sketch-based image retrieval (GZS-SBIR), which is the focus of this work. To this end, we propose a novel Augmented Multi-modality Fusion (AMF) framework to generalize seen concepts to unobserved ones efficiently. Specifically, a novel knowledge discovery module named cross-domain augmentation is designed in both visual and semantic space to mimic novel knowledge unseen from the training stage, which is the key to handling the GZS-SBIR challenge. Moreover, a triplet domain alignment module is proposed to couple the cross-domain distribution between photo and sketch in visual space. To enhance the robustness of our model, we explore embedding propagation to refine both visual and semantic features by removing undesired noise. Eventually, visual-semantic fusion representations are concatenated for further domain discrimination and task-specific recognition, which tend to trigger the cross-domain alignment in both visual and semantic feature space. Experimental evaluations are conducted on popular ZS-SBIR benchmarks as well as a new evaluation protocol designed for GZS-SBIR from DomainNet dataset with more diverse sub-domains, and the promising results demonstrate the superiority of the proposed solution over other baselines. The source code is available at https://github.com/scottjingtt/AMF_GZS_SBIR.git.
Collapse
|
19
|
Xu L, Zeng X, Zheng B, Li W. Multi-Manifold Deep Discriminative Cross-Modal Hashing for Medical Image Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3371-3385. [PMID: 35507618 DOI: 10.1109/tip.2022.3171081] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Benefitting from the low storage cost and high retrieval efficiency, hash learning has become a widely used retrieval technology to approximate nearest neighbors. Within it, the cross-modal medical hashing has attracted an increasing attention in facilitating efficiently clinical decision. However, there are still two main challenges in weak multi-manifold structure perseveration across multiple modalities and weak discriminability of hash code. Specifically, existing cross-modal hashing methods focus on pairwise relations within two modalities, and ignore underlying multi-manifold structures across over 2 modalities. Then, there is little consideration about discriminability, i.e., any pair of hash codes should be different. In this paper, we propose a novel hashing method named multi-manifold deep discriminative cross-modal hashing (MDDCH) for large-scale medical image retrieval. The key point is multi-modal manifold similarity which integrates multiple sub-manifolds defined on heterogeneous data to preserve correlation among instances, and it can be measured by three-step connection on corresponding hetero-manifold. Then, we propose discriminative item to make each hash code encoded by hash functions be different, which improves discriminative performance of hash code. Besides, we introduce Gaussian-binary Restricted Boltzmann Machine to directly output hash codes without using any continuous relaxation. Experiments on three benchmark datasets (AIBL, Brain and SPLP) show that our proposed MDDCH achieves comparative performance to recent state-of-the-art hashing methods. Additionally, diagnostic evaluation from professional physicians shows that all the retrieved medical images describe the same object and illness as the queried image.
Collapse
|
20
|
Hou C, Li Z, Wu J. Unsupervised hash retrieval based on multiple similarity matrices and text self-attention mechanism. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02804-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
21
|
Yu Z, Wu S, Dou Z, Bakker EM. Deep hashing with self-supervised asymmetric semantic excavation and margin-scalable constraint. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.01.082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
22
|
Khan A, Hayat S, Ahmad M, Wen J, Farooq MU, Fang M, Jiang W. Cross‐modal retrieval based on deep regularized hashing constraints. INT J INTELL SYST 2022. [DOI: 10.1002/int.22853] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Asad Khan
- School of Computer Science and Cyber Engineering Guangzhou University Guangzhou China
| | - Sakander Hayat
- School of Mathematics and Information Sciences Guangzhou University Guangzhou China
| | - Muhammad Ahmad
- Department of Computer Science National University of Computer and Emerging Sciences (NUCES‐FAST) Faisalabad Campus Chiniot Pakistan
| | - Jinyu Wen
- School of Computer Science and Cyber Engineering Guangzhou University Guangzhou China
| | - Muhammad Umar Farooq
- School of Computer Science and Technology University of Science and Technology of China Hefei China
| | - Meie Fang
- School of Computer Science and Cyber Engineering Guangzhou University Guangzhou China
| | - Wenchao Jiang
- School of Computers Guangdong University of Technology Guangzhou China
| |
Collapse
|
23
|
Semantic-guided autoencoder adversarial hashing for large-scale cross-modal retrieval. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-021-00615-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
AbstractWith the vigorous development of mobile Internet technology and the popularization of smart devices, while the amount of multimedia data has exploded, its forms have become more and more diversified. People’s demand for information is no longer satisfied with single-modal data retrieval, and cross-modal retrieval has become a research hotspot in recent years. Due to the strong feature learning ability of deep learning, cross-modal deep hashing has been extensively studied. However, the similarity of different modalities is difficult to measure directly because of the different distribution and representation of cross-modal. Therefore, it is urgent to eliminate the modal gap and improve retrieval accuracy. Some previous research work has introduced GANs in cross-modal hashing to reduce semantic differences between different modalities. However, most of the existing GAN-based cross-modal hashing methods have some issues such as network training is unstable and gradient disappears, which affect the elimination of modal differences. To solve this issue, this paper proposed a novel Semantic-guided Autoencoder Adversarial Hashing method for cross-modal retrieval (SAAH). First of all, two kinds of adversarial autoencoder networks, under the guidance of semantic multi-labels, maximize the semantic relevance of instances and maintain the immutability of cross-modal. Secondly, under the supervision of semantics, the adversarial module guides the feature learning process and maintains the modality relations. In addition, to maintain the inter-modal correlation of all similar pairs, this paper use two types of loss functions to maintain the similarity. To verify the effectiveness of our proposed method, sufficient experiments were conducted on three widely used cross-modal datasets (MIRFLICKR, NUS-WIDE and MS COCO), and compared with several representatives advanced cross-modal retrieval methods, SAAH achieved leading retrieval performance.
Collapse
|
24
|
Multimodal graph inference network for scene graph generation. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02304-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
25
|
Yu E, Ma J, Sun J, Chang X, Zhang H, Hauptmann AG. Deep Discrete Cross-Modal Hashing with Multiple Supervision. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.11.035] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
26
|
A rough margin-based multi-task ν-twin support vector machine for pattern classification. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107769] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
27
|
Wang C, Bai X, Wang X, Liu X, Zhou J, Wu X, Li H, Tao D. Self-Supervised Multiscale Adversarial Regression Network for Stereo Disparity Estimation. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4770-4783. [PMID: 32649284 DOI: 10.1109/tcyb.2020.2999492] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Deep learning approaches have significantly contributed to recent progress in stereo matching. These deep stereo matching methods are usually based on supervised training, which requires a large amount of high-quality ground-truth depth map annotations that are expensive to collect. Furthermore, only a limited quantity of stereo vision training data are currently available, obtained either by active sensors (Lidar and ToF cameras) or through computer graphics simulations and not meeting requirements for deep supervised training. Here, we propose a novel deep stereo approach called the "self-supervised multiscale adversarial regression network (SMAR-Net)," which relaxes the need for ground-truth depth maps for training. Specifically, we design a two-stage network. The first stage is a disparity regressor, in which a regression network estimates disparity values from stacked stereo image pairs. Stereo image stacking method is a novel contribution as it not only contains the spatial appearances of stereo images but also implies matching correspondences with different disparity values. In the second stage, a synthetic left image is generated based on the left-right consistency assumption. Our network is trained by minimizing a hybrid loss function composed of a content loss and an adversarial loss. The content loss minimizes the average warping error between the synthetic images and the real ones. In contrast to the generative adversarial loss, our proposed adversarial loss penalizes mismatches using multiscale features. This constrains the synthetic image and real image as being pixelwise identical instead of just belonging to the same distribution. Furthermore, the combined utilization of multiscale feature extraction in both the content loss and adversarial loss further improves the adaptability of SMAR-Net in ill-posed regions. Experiments on multiple benchmark datasets show that SMAR-Net outperforms the current state-of-the-art self-supervised methods and achieves comparable outcomes to supervised methods. The source code can be accessed at: https://github.com/Dawnstar8411/SMAR-Net.
Collapse
|
28
|
Wang J, Xu S, Zheng F, Lu K, Song J, Shao L. Learning Efficient Hash Codes for Fast Graph-Based Data Similarity Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:6321-6334. [PMID: 34224353 DOI: 10.1109/tip.2021.3093387] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Traditional operations, e.g. graph edit distance (GED), are no longer suitable for processing the massive quantities of graph-structured data now available, due to their irregular structures and high computational complexities. With the advent of graph neural networks (GNNs), the problems of graph representation and graph similarity search have drawn particular attention in the field of computer vision. However, GNNs have been less studied for efficient and fast retrieval after graph representation. To represent graph-based data, and maintain fast retrieval while doing so, we introduce an efficient hash model with graph neural networks (HGNN) for a newly designed task (i.e. fast graph-based data retrieval). Due to its flexibility, HGNN can be implemented in both an unsupervised and supervised manner. Specifically, by adopting a graph neural network and hash learning algorithms, HGNN can effectively learn a similarity-preserving graph representation and compute pair-wise similarity or provide classification via low-dimensional compact hash codes. To the best of our knowledge, our model is the first to address graph hashing representation in the Hamming space. Our experimental results reach comparable prediction accuracy to full-precision methods and can even outperform traditional models in some cases. In real-world applications, using hash codes can greatly benefit systems with smaller memory capacities and accelerate the retrieval speed of graph-structured data. Hence, we believe the proposed HGNN has great potential in further research.
Collapse
|
29
|
Beltrán LVB, Caicedo JC, Journet N, Coustaty M, Lecellier F, Doucet A. Deep multimodal learning for cross-modal retrieval: One model for all tasks. Pattern Recognit Lett 2021. [DOI: 10.1016/j.patrec.2021.02.021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
30
|
Chen Y, Huang R, Chang H, Tan C, Xue T, Ma B. Cross-Modal Knowledge Adaptation for Language-Based Person Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4057-4069. [PMID: 33788687 DOI: 10.1109/tip.2021.3068825] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this paper, we present a method named Cross-Modal Knowledge Adaptation (CMKA) for language-based person search. We argue that the image and text information are not equally important in determining a person's identity. In other words, image carries image-specific information such as lighting condition and background, while text contains more modal agnostic information that is more beneficial to cross-modal matching. Based on this consideration, we propose CMKA to adapt the knowledge of image to the knowledge of text. Specially, text-to-image guidance is obtained at different levels: individuals, lists, and classes. By combining these levels of knowledge adaptation, the image-specific information is suppressed, and the common space of image and text is better constructed. We conduct experiments on the CUHK-PEDES dataset. The experimental results show that the proposed CMKA outperforms the state-of-the-art methods.
Collapse
|
31
|
Zhao W, Guan Z, Luo H, Peng J, Fan J. Deep Multiple Instance Hashing for Fast Multi-Object Image Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:7995-8007. [PMID: 34554911 DOI: 10.1109/tip.2021.3112011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Multi-keyword query is widely supported in text search engines. However, an analogue in image retrieval systems, multi-object query, is rarely studied. Meanwhile, traditional object-based image retrieval methods often involve multiple steps separately. In this work, we propose a weakly-supervised Deep Multiple Instance Hashing (DMIH) approach for multi-object image retrieval. Our DMIH approach, which leverages a popular CNN model to build the end-to-end relation between a raw image and the binary hash codes of its multiple objects, can support multi-object queries effectively and integrate object detection with hashing learning seamlessly. We treat object detection as a binary multiple instance learning (MIL) problem and such instances are automatically extracted from multi-scale convolutional feature maps. We also design a conditional random field (CRF) module to capture both the semantic and spatial relations among different class labels. For hashing training, we sample image pairs to learn their semantic relationships in terms of hash codes of the most probable proposals for owned labels as guided by object predictors. The two objectives benefit each other in a multi-task learning scheme. Finally, a two-level inverted index method is proposed to further speed up the retrieval of multi-object queries. Our DMIH approach outperforms state-of-the-arts on public benchmarks for object-based image retrieval and achieves promising results for multi-object queries.
Collapse
|
32
|
Zhu S, Feng Y, Zhou M, Qiang B, Fang B, Wei R. Prototype-Based Discriminative Feature Representation for Class-incremental Cross-modal Retrieval. INT J PATTERN RECOGN 2020. [DOI: 10.1142/s021800142150018x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Cross-modal retrieval aims to retrieve the related items from various modalities with respect to a query from any type. The key challenge of cross-modal retrieval is to learn more discriminative representations between different category, as well as expand to an unseen class retrieval in the open world retrieval task. To tackle the above problem, in this paper, we propose a prototype learning-based discriminative feature learning (PLDFL) to learn more discriminative representations in a common space. First, we utilize a prototype learning algorithm to cluster these samples labeled with the same semantic class, by jointly taking into consideration the intra-class compactness and inter-class sparsity without discriminative treatments. Second, we use the weight-sharing strategy to model the correlations of cross-modal samples to narrow down the modality gap. Finally, we apply the prototype to achieve class-incremental learning to prove the robustness of our proposed approach. According to our experimental results, significant retrieval performance in terms of mAP can be achieved on average compared to several state-of-the-art approaches.
Collapse
Affiliation(s)
- Shaoquan Zhu
- College of Computer Science, Chongqing University, Chongqing 400030, P. R. China
- Key Laboratory of Dependable Service, Computing in Cyber Physical Society, Ministry of Education, Chongqing 400030, P. R. China
| | - Yong Feng
- College of Computer Science, Chongqing University, Chongqing 400030, P. R. China
- Key Laboratory of Dependable Service, Computing in Cyber Physical Society, Ministry of Education, Chongqing 400030, P. R. China
| | - Mingliang Zhou
- State Key Lab of IoT for Smart City, CIS, University of Macau, Macau SAR 999078, P. R. China
| | - Baohua Qiang
- Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, P. R. China
- Guangxi Key Laboratory of Optoelectroric Information Processing, Guilin University of Electronic Technology, Guilin 541004, P. R. China
| | - Bin Fang
- College of Computer Science, Chongqing University, Chongqing 400030, P. R. China
- Key Laboratory of Dependable Service, Computing in Cyber Physical Society, Ministry of Education, Chongqing 400030, P. R. China
| | - Ran Wei
- Chongqing Medical Data Information Technology Co., Ltd, Building 3, Block B, Administration Centre, Nanan District Chongqing 401336, P. R. China
| |
Collapse
|
33
|
|
34
|
Deep Semantic-Preserving Reconstruction Hashing for Unsupervised Cross-Modal Retrieval. ENTROPY 2020; 22:e22111266. [PMID: 33287034 PMCID: PMC7712897 DOI: 10.3390/e22111266] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 11/02/2020] [Accepted: 11/05/2020] [Indexed: 11/16/2022]
Abstract
Deep hashing is the mainstream algorithm for large-scale cross-modal retrieval due to its high retrieval speed and low storage capacity, but the problem of reconstruction of modal semantic information is still very challenging. In order to further solve the problem of unsupervised cross-modal retrieval semantic reconstruction, we propose a novel deep semantic-preserving reconstruction hashing (DSPRH). The algorithm combines spatial and channel semantic information, and mines modal semantic information based on adaptive self-encoding and joint semantic reconstruction loss. The main contributions are as follows: (1) We introduce a new spatial pooling network module based on tensor regular-polymorphic decomposition theory to generate rank-1 tensor to capture high-order context semantics, which can assist the backbone network to capture important contextual modal semantic information. (2) Based on optimization perspective, we use global covariance pooling to capture channel semantic information and accelerate network convergence. In feature reconstruction layer, we use two bottlenecks auto-encoding to achieve visual-text modal interaction. (3) In metric learning, we design a new loss function to optimize model parameters, which can preserve the correlation between image modalities and text modalities. The DSPRH algorithm is tested on MIRFlickr-25K and NUS-WIDE. The experimental results show that DSPRH has achieved better performance on retrieval tasks.
Collapse
|
35
|
Fan C, Liu P, Xiao T, Zhao W, Tang X. Domain adaptation based on domain-invariant and class-distinguishable feature learning using multiple adversarial networks. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.06.044] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|