1
|
Wang Z, Gao Z, Yang Y, Wang G, Jiao C, Shen HT. Geometric Matching for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5509-5521. [PMID: 38652629 DOI: 10.1109/tnnls.2024.3381347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Despite its significant progress, cross-modal retrieval still suffers from one-to-many matching cases, where the multiplicity of semantic instances in another modality could be acquired by a given query. However, existing approaches usually map heterogeneous data into the learned space as deterministic point vectors. In spite of their remarkable performance in matching the most similar instance, such deterministic point embedding suffers from the insufficient representation of rich semantics in one-to-many correspondence. To address the limitations, we intuitively extend a deterministic point into a closed geometry and develop geometric representation learning methods for cross-modal retrieval. Thus, a set of points inside such a geometry could be semantically related to many candidates, and we could effectively capture the semantic uncertainty. We then introduce two types of geometric matching for one-to-many correspondence, i.e., point-to-rectangle matching (dubbed P2RM) and rectangle-to-rectangle matching (termed R2RM). The former treats all retrieved candidates as rectangles with zero volume (equivalent to points) and the query as a box, while the latter encodes all heterogeneous data into rectangles. Therefore, we could evaluate semantic similarity among heterogeneous data by the Euclidean distance from a point to a rectangle or the volume of intersection between two rectangles. Additionally, both strategies could be easily employed for off-the-self approaches and further improve the retrieval performance of baselines. Under various evaluation metrics, extensive experiments and ablation studies on several commonly used datasets, two for image-text matching and two for video-text retrieval, demonstrate our effectiveness and superiority.
Collapse
|
2
|
Alizadeh SM, Helfroush MS, Celebi ME. An Innovative Attention-based Triplet Deep Hashing Approach to Retrieve Histopathology Images. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024:10.1007/s10278-024-01310-8. [PMID: 39528884 DOI: 10.1007/s10278-024-01310-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 10/16/2024] [Accepted: 10/19/2024] [Indexed: 11/16/2024]
Abstract
Content-based histopathology image retrieval (CBHIR) can assist in the diagnosis of different diseases. The retrieval procedure can be complex and time-consuming if high-dimensional features are required. Thus, hashing techniques are employed to address these issues by mapping the feature space into binary values of varying lengths. The performance of deep hashing approaches in image retrieval is often superior to that of traditional hashing methods. Among deep hashing approaches, triplet-based models are typically more effective than pairwise ones. Recent studies have demonstrated that incorporating the attention mechanism into a deep hashing approach can improve its effectiveness in retrieving images. This paper presents an innovative triplet deep hashing strategy based on the attention mechanism for retrieving histopathology images, called histopathology attention triplet deep hashing (HATDH). Three deep attention-based hashing models with identical architectures and weights are employed to produce binary values. The proposed attention module can aid the models in extracting features more efficiently. Moreover, we introduce an improved triplet loss function considering pair inputs separately in addition to triplet inputs for increasing efficiency during the training and retrieval steps. Based on experiments conducted on two public histopathology datasets, BreakHis and Kather, HATDH significantly outperforms state-of-the-art hashing algorithms.
Collapse
Affiliation(s)
| | | | - M Emre Celebi
- Department of Computer Science and Engineering, University of Central Arkansas, Conway, AR, 72035, USA
| |
Collapse
|
3
|
Liang X, Yang E, Yang Y, Deng C. Multi-Relational Deep Hashing for Cross-Modal Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3009-3020. [PMID: 38625760 DOI: 10.1109/tip.2024.3385656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2024]
Abstract
Deep cross-modal hashing retrieval has recently made significant progress. However, existing methods generally learn hash functions with pairwise or triplet supervisions, which involves learning the relevant information by splicing partial similarity between data pairs; notably, this approach only captures the data similarity locally and incompletely, resulting in sub-optimal retrieval performance. In this paper, we propose a novel Multi-Relational Deep Hashing (MRDH) approach, which can fully bridge the modality gap by comprehensively modeling the similarity relationship between data in different modalities. In more detail, to investigate the inter-modal relationships, we constrain the consistency of cross-modal pairwise similarities to maintain the semantic similarity across modalities. Moreover, to further capture complete similarity information, we design a new similarity metric, which we term cross-modal global similarity, by encouraging hash codes of similar data pairs from different modalities to approach a common center and hash codes for dissimilar pairs to converge to different centers. Adopting this approach enables our model to generate more discriminative hash codes. Extensive experiments on three benchmark datasets demonstrate the superiority of our method on cross-modal hashing retrieval.
Collapse
|
4
|
Bai C, Zeng C, Ma Q, Zhang J. Graph Convolutional Network Discrete Hashing for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:4756-4767. [PMID: 35604998 DOI: 10.1109/tnnls.2022.3174970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
With the rapid development of deep neural networks, cross-modal hashing has made great progress. However, the information of different types of data is asymmetrical, that is to say, if the resolution of an image is high enough, it can reproduce almost 100% of the real-world scenes. However, text usually carries personal emotion and it is not objective enough, so we generally think that the information of image will be much richer than text. Although most of the existing methods unify the semantic feature extraction and hash function learning modules for end-to-end learning, they ignore this issue and do not use information-rich modalities to support information-poor modalities, leading to suboptimal results, although they unify the semantic feature extraction and hash function learning modules for end-to-end learning. Furthermore, previous methods learn hash functions in a relaxed way that causes nontrivial quantization losses. To address these issues, we propose a new method called graph convolutional network (GCN) discrete hashing. This method uses a GCN to bridge the information gap between different types of data. The GCN can represent each label as word embedding, with the embedding regarded as a set of interdependent object classifiers. From these classifiers, we can obtain predicted labels to enhance feature representations across modalities. In addition, we use an efficient discrete optimization strategy to learn the discrete binary codes without relaxation. Extensive experiments conducted on three commonly used datasets demonstrate that our proposed method graph convolutional network-based discrete hashing (GCDH) outperforms the current state-of-the-art cross-modal hashing methods.
Collapse
|
5
|
Zhang G, Li S, Wei S, Ge S, Cai N, Zhao Y. Multimodal Composition Example Mining for Composed Query Image Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:1149-1161. [PMID: 38300775 DOI: 10.1109/tip.2024.3359062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Composed query image retrieval task aims to retrieve the target image in the database by a query that composes two different modalities: a reference image and a sentence declaring that some details of the reference image need to be modified and replaced by new elements. Tackling this task needs to learn a multimodal embedding space, which can make semantically similar targets and queries close but dissimilar targets and queries as far away as possible. Most of the existing methods start from the perspective of model structure and design some clever interactive modules to promote the better fusion and embedding of different modalities. However, their learning objectives use conventional query-level examples as negatives while neglecting the composed query's multimodal characteristics, leading to the inadequate utilization of the training data and suboptimal construction of metric space. To this end, in this paper, we propose to improve the learning objective by constructing and mining hard negative examples from the perspective of multimodal fusion. Specifically, we compose the reference image and its logically unpaired sentences rather than paired ones to create component-level negative examples to better use data and enhance the optimization of metric space. In addition, we further propose a new sentence augmentation method to generate more indistinguishable multimodal negative examples from the element level and help the model learn a better metric space. Massive comparison experiments on four real-world datasets confirm the effectiveness of the proposed method.
Collapse
|
6
|
Hoang T, Do TT, Nguyen TV, Cheung NM. Multimodal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6289-6302. [PMID: 34982698 DOI: 10.1109/tnnls.2021.3135420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this article, we adopt the maximizing mutual information (MI) approach to tackle the problem of unsupervised learning of binary hash codes for efficient cross-modal retrieval. We proposed a novel method, dubbed cross-modal info-max hashing (CMIMH). First, to learn informative representations that can preserve both intramodal and intermodal similarities, we leverage the recent advances in estimating variational lower bound of MI to maximizing the MI between the binary representations and input features and between binary representations of different modalities. By jointly maximizing these MIs under the assumption that the binary representations are modeled by multivariate Bernoulli distributions, we can learn binary representations, which can preserve both intramodal and intermodal similarities, effectively in a mini-batch manner with gradient descent. Furthermore, we find out that trying to minimize the modality gap by learning similar binary representations for the same instance from different modalities could result in less informative representations. Hence, balancing between reducing the modality gap and losing modality-private information is important for the cross-modal retrieval tasks. Quantitative evaluations on standard benchmark datasets demonstrate that the proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
Collapse
|
7
|
Hu P, Zhu H, Lin J, Peng D, Zhao YP, Peng X. Unsupervised Contrastive Cross-Modal Hashing. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:3877-3889. [PMID: 35617190 DOI: 10.1109/tpami.2022.3177356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this paper, we study how to make unsupervised cross-modal hashing (CMH) benefit from contrastive learning (CL) by overcoming two challenges. To be exact, i) to address the performance degradation issue caused by binary optimization for hashing, we propose a novel momentum optimizer that performs hashing operation learnable in CL, thus making on-the-shelf deep cross-modal hashing possible. In other words, our method does not involve binary-continuous relaxation like most existing methods, thus enjoying better retrieval performance; ii) to alleviate the influence brought by false-negative pairs (FNPs), we propose a Cross-modal Ranking Learning loss (CRL) which utilizes the discrimination from all instead of only the hard negative pairs, where FNP refers to the within-class pairs that were wrongly treated as negative pairs. Thanks to such a global strategy, CRL endows our method with better performance because CRL will not overuse the FNPs while ignoring the true-negative pairs. To the best of our knowledge, the proposed method could be one of the first successful contrastive hashing methods. To demonstrate the effectiveness of the proposed method, we carry out experiments on five widely-used datasets compared with 13 state-of-the-art methods. The code is available at https://github.com/penghu-cs/UCCH.
Collapse
|
8
|
CCAH: A CLIP-Based Cycle Alignment Hashing Method for Unsupervised Vision-Text Retrieval. INT J INTELL SYST 2023. [DOI: 10.1155/2023/7992047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Due to the advantages of low storage cost and fast retrieval efficiency, deep hashing methods are widely used in cross-modal retrieval. Images are usually accompanied by corresponding text descriptions rather than labels. Therefore, unsupervised methods have been widely concerned. However, due to the modal divide and semantic differences, existing unsupervised methods cannot adequately bridge the modal differences, leading to suboptimal retrieval results. In this paper, we propose CLIP-based cycle alignment hashing for unsupervised vision-text retrieval (CCAH), which aims to exploit the semantic link between the original features of modalities and the reconstructed features. Firstly, we design a modal cyclic interaction method that aligns semantically within intramodality, where one modal feature reconstructs another modal feature, thus taking full account of the semantic similarity between intramodal and intermodal relationships. Secondly, introducing GAT into cross-modal retrieval tasks. We consider the influence of text neighbour nodes and add attention mechanisms to capture the global features of text modalities. Thirdly, Fine-grained extraction of image features using the CLIP visual coder. Finally, hash encoding is learned through hash functions. The experiments demonstrate on three widely used datasets that our proposed CCAH achieves satisfactory results in total retrieval accuracy. Our code can be found at: https://github.com/CQYIO/CCAH.git.
Collapse
|
9
|
Xia W, Wang T, Gao Q, Yang M, Gao X. Graph Embedding Contrastive Multi-Modal Representation Learning for Clustering. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1170-1183. [PMID: 37022431 DOI: 10.1109/tip.2023.3240863] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Multi-modal clustering (MMC) aims to explore complementary information from diverse modalities for clustering performance facilitating. This article studies challenging problems in MMC methods based on deep neural networks. On one hand, most existing methods lack a unified objective to simultaneously learn the inter- and intra-modality consistency, resulting in a limited representation learning capacity. On the other hand, most existing processes are modeled for a finite sample set and cannot handle out-of-sample data. To handle the above two challenges, we propose a novel Graph Embedding Contrastive Multi-modal Clustering network (GECMC), which treats the representation learning and multi-modal clustering as two sides of one coin rather than two separate problems. In brief, we specifically design a contrastive loss by benefiting from pseudo-labels to explore consistency across modalities. Thus, GECMC shows an effective way to maximize the similarities of intra-cluster representations while minimizing the similarities of inter-cluster representations at both inter- and intra-modality levels. So, the clustering and representation learning interact and jointly evolve in a co-training framework. After that, we build a clustering layer parameterized with cluster centroids, showing that GECMC can learn the clustering labels with given samples and handle out-of-sample data. GECMC yields superior results than 14 competitive methods on four challenging datasets. Codes and datasets are available: https://github.com/xdweixia/GECMC.
Collapse
|
10
|
Li Z, Nie F, Wu D, Hu Z, Li X. Unsupervised Feature Selection With Weighted and Projected Adaptive Neighbors. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:1260-1271. [PMID: 34343100 DOI: 10.1109/tcyb.2021.3087632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In the field of data mining, how to deal with high-dimensional data is a fundamental problem. If they are used directly, it is not only computationally expensive but also difficult to obtain satisfactory results. Unsupervised feature selection is designed to reduce the dimension of data by finding a subset of features in the absence of labels. Many unsupervised methods perform feature selection by exploring spectral analysis and manifold learning, such that the intrinsic structure of data can be preserved. However, most of these methods ignore a fact: due to the existence of noise features, the intrinsic structure directly built from original data may be unreliable. To solve this problem, a new unsupervised feature selection model is proposed. The graph structure, feature weights, and projection matrix are learned simultaneously, such that the intrinsic structure is constructed by the data that have been feature weighted and projected. For each data point, its nearest neighbors are acquired in the process of graph construction. Therefore, we call them adaptive neighbors. Besides, an additional constraint is added to the proposed model. It requires that a graph, corresponding to a similarity matrix, should contain exactly c connected components. Then, we present an optimization algorithm to solve the proposed model. Next, we discuss the method of determining the regularization parameter γ in our proposed method and analyze the computational complexity of the optimization algorithm. Finally, experiments are implemented on both synthetic and real-world datasets to demonstrate the effectiveness of the proposed method.
Collapse
|
11
|
Wang H, Deng C, Liu T, Tao D. Transferable Coupled Network for Zero-Shot Sketch-Based Image Retrieval. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:9181-9194. [PMID: 34705637 DOI: 10.1109/tpami.2021.3123315] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) aims at searching corresponding natural images with the given free-hand sketches, under the more realistic and challenging scenario of Zero-Shot Learning (ZSL). Prior works concentrate much on aligning the sketch and image feature representations while ignoring the explicit learning of heterogeneous feature extractors to make themselves capable of aligning multi-modal features, with the expense of deteriorating the transferability from seen categories to unseen ones. To address this issue, we propose a novel Transferable Coupled Network (TCN) to effectively improve network transferability, with the constraint of soft weight-sharing among heterogeneous convolutional layers to capture similar geometric patterns, e.g., contours of sketches and images. Based on this, we further introduce and validate a general criterion to deal with multi-modal zero-shot learning, i.e., utilizing coupled modules for mining modality-common knowledge while independent modules for learning modality-specific information. Moreover, we elaborate a simple but effective semantic metric to integrate local metric learning and global semantic constraint into a unified formula to significantly boost the performance. Extensive experiments on three popular large-scale datasets show that our proposed approach outperforms state-of-the-art methods to a remarkable extent: by more than 12% on Sketchy, 2% on TU-Berlin and 6% on QuickDraw datasets in terms of retrieval accuracy. The project page is available at: https://haowang1992.github.io/publication/TCN.
Collapse
|
12
|
DHAN: Encrypted JPEG image retrieval via DCT histograms-based attention networks. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
13
|
Hu P, Peng X, Zhu H, Zhen L, Lin J, Yan H, Peng D. Deep Semisupervised Multiview Learning With Increasing Views. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:12954-12965. [PMID: 34499609 DOI: 10.1109/tcyb.2021.3093626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In this article, we study two challenging problems in semisupervised cross-view learning. On the one hand, most existing methods assume that the samples in all views have a pairwise relationship, that is, it is necessary to capture or establish the correspondence of different views at the sample level. Such an assumption is easily isolated even in the semisupervised setting wherein only a few samples have labels that could be used to establish the correspondence. On the other hand, almost all existing multiview methods, including semisupervised ones, usually train a model using a fixed dataset, which cannot handle the data of increasing views. In practice, the view number will increase when new sensors are deployed. To address the above two challenges, we propose a novel method that employs multiple independent semisupervised view-specific networks (ISVNs) to learn representation for multiple views in a view-decoupling fashion. The advantages of our method are two-fold. Thanks to our specifically designed autoencoder and pseudolabel learning paradigm, our method shows an effective way to utilize both the labeled and unlabeled data while relaxing the data assumption of the pairwise relationship, that is, correspondence. Furthermore, with our view decoupling strategy, the proposed ISVNs could be separately trained, thus efficiently handling the data of increasing views without retraining the entire model. To the best of our knowledge, our ISVN could be one of the first attempts to make handling increasing views in the semisupervised setting possible, as well as an effective solution to the noncorresponding problem. To verify the effectiveness and efficiency of our method, we conduct comprehensive experiments by comparing 13 state-of-the-art approaches on four multiview datasets in terms of retrieval and classification.
Collapse
|
14
|
A deep hashing method of likelihood function adaptive mapping. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07962-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
15
|
Zhang G, Wei S, Pang H, Qiu S, Zhao Y. Composed Image Retrieval via Explicit Erasure and Replenishment With Semantic Alignment. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:5976-5988. [PMID: 36094980 DOI: 10.1109/tip.2022.3204213] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Composed image retrieval aims at retrieving the desired images, given a reference image and a text piece. To handle this task, two important subprocesses should be modeled reasonably. One is to erase irrelated details of the reference image against the text piece, and the other is to replenish the desired details in the image against the text piece. Nowadays, the existing methods neglect to distinguish between the two subprocesses and implicitly put them together to solve the composed image retrieval task. To explicitly and orderly model the two subprocesses of the task, we propose a novel composed image retrieval method which contains three key components, i.e., Multi-semantic Dynamic Suppression module (MDS), Text-semantic Complementary Selection module (TCS), and Semantic Space Alignment constraints (SSA). Concretely, MDS is to erase irrelated details of the reference image by suppressing its semantic features. TCS aims to select and enhance the semantic features of the text piece and then replenish them to the reference image. In the end, to facilitate the erasure and replenishment subprocesses, SSA aligns the semantics of the two modality features in the final space. Extensive experiments on three benchmark datasets (Shoes, FashionIQ, and Fashion200K) show the superior performance of our approach against state-of-the-art methods.
Collapse
|
16
|
Zhu L, Wang T, Li J, Zhang Z, Shen J, Wang X. Efficient Query-based Black-Box Attack against Cross-modal Hashing Retrieval. ACM T INFORM SYST 2022. [DOI: 10.1145/3559758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
Deep cross-modal hashing retrieval models inherit the vulnerability of deep neural networks. They are vulnerable to adversarial attacks, especially for the form of subtle perturbations to the inputs. Although many adversarial attack methods have been proposed to handle the robustness of hashing retrieval models, they still suffer from two problems: 1) Most of them are based on the white-box settings, which is usually unrealistic in practical application. 2) Iterative optimization for the generation of adversarial examples in them results in heavy computation. To address these problems, we propose an Efficient Query-based Black-Box Attack (EQB
2
A) against deep cross-modal hashing retrieval, which can efficiently generate adversarial examples for the black-box attack. Specifically, by sending a few query requests to the attacked retrieval system, the cross-modal retrieval model stealing is performed based on the neighbor relationship between the retrieved results and the query, thus obtaining the knockoffs to substitute the attacked system. A multi-modal knockoffs-driven adversarial generation is proposed to achieve efficient adversarial example generation. While the entire network training converges, EQB
2
A can efficiently generate adversarial examples by forward-propagation with only given benign images. Experiments show that EQB
2
A achieves superior attacking performance under the black-box setting.
Collapse
Affiliation(s)
- Lei Zhu
- Shandong Normal University Peng Cheng Laboratory, China
| | | | - Jingjing Li
- University of Electronic Science and Technology of China, China
| | - Zheng Zhang
- Harbin Institute of Technology, Shenzhen, China
| | | | | |
Collapse
|
17
|
Qin J, Fei L, Zhang Z, Wen J, Xu Y, Zhang D. Joint Specifics and Consistency Hash Learning for Large-Scale Cross-Modal Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:5343-5358. [PMID: 35925845 DOI: 10.1109/tip.2022.3195059] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
With the dramatic increase in the amount of multimedia data, cross-modal similarity retrieval has become one of the most popular yet challenging problems. Hashing offers a promising solution for large-scale cross-modal data searching by embedding the high-dimensional data into the low-dimensional similarity preserving Hamming space. However, most existing cross-modal hashing usually seeks a semantic representation shared by multiple modalities, which cannot fully preserve and fuse the discriminative modal-specific features and heterogeneous similarity for cross-modal similarity searching. In this paper, we propose a joint specifics and consistency hash learning method for cross-modal retrieval. Specifically, we introduce an asymmetric learning framework to fully exploit the label information for discriminative hash code learning, where 1) each individual modality can be better converted into a meaningful subspace with specific information, 2) multiple subspaces are semantically connected to capture consistent information, and 3) the integration complexity of different subspaces is overcome so that the learned collaborative binary codes can merge the specifics with consistency. Then, we introduce an alternatively iterative optimization to tackle the specifics and consistency hashing learning problem, making it scalable for large-scale cross-modal retrieval. Extensive experiments on five widely used benchmark databases clearly demonstrate the effectiveness and efficiency of our proposed method on both one-cross-one and one-cross-two retrieval tasks.
Collapse
|
18
|
Xie Y, Zeng X, Wang T, Yi Y. Online deep hashing for both uni-modal and cross-modal retrieval. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.07.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
19
|
Zhang Z, Li Z, Wei K, Pan S, Deng C. A survey on multimodal-guided visual content synthesis. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
20
|
Xu L, Zeng X, Zheng B, Li W. Multi-Manifold Deep Discriminative Cross-Modal Hashing for Medical Image Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3371-3385. [PMID: 35507618 DOI: 10.1109/tip.2022.3171081] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Benefitting from the low storage cost and high retrieval efficiency, hash learning has become a widely used retrieval technology to approximate nearest neighbors. Within it, the cross-modal medical hashing has attracted an increasing attention in facilitating efficiently clinical decision. However, there are still two main challenges in weak multi-manifold structure perseveration across multiple modalities and weak discriminability of hash code. Specifically, existing cross-modal hashing methods focus on pairwise relations within two modalities, and ignore underlying multi-manifold structures across over 2 modalities. Then, there is little consideration about discriminability, i.e., any pair of hash codes should be different. In this paper, we propose a novel hashing method named multi-manifold deep discriminative cross-modal hashing (MDDCH) for large-scale medical image retrieval. The key point is multi-modal manifold similarity which integrates multiple sub-manifolds defined on heterogeneous data to preserve correlation among instances, and it can be measured by three-step connection on corresponding hetero-manifold. Then, we propose discriminative item to make each hash code encoded by hash functions be different, which improves discriminative performance of hash code. Besides, we introduce Gaussian-binary Restricted Boltzmann Machine to directly output hash codes without using any continuous relaxation. Experiments on three benchmark datasets (AIBL, Brain and SPLP) show that our proposed MDDCH achieves comparative performance to recent state-of-the-art hashing methods. Additionally, diagnostic evaluation from professional physicians shows that all the retrieved medical images describe the same object and illness as the queried image.
Collapse
|
21
|
Xu X, Lin K, Gao L, Lu H, Shen HT, Li X. Learning Cross-Modal Common Representations by Private-Shared Subspaces Separation. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3261-3275. [PMID: 32780706 DOI: 10.1109/tcyb.2020.3009004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Due to the inconsistent distributions and representations of different modalities (e.g., images and texts), it is very challenging to correlate such heterogeneous data. A standard solution is to construct one common subspace, where the common representations of different modalities are generated to bridge the heterogeneity gap. Existing methods based on common representation learning mostly adopt a less effective two-stage paradigm: first, generating separate representations for each modality by exploiting the modality-specific properties as the complementary information, and then capturing the cross-modal correlation in the separate representations for common representation learning. Moreover, these methods usually neglect that there may exist interference in the modality-specific properties, that is, the unrelated objects and background regions in images or the noisy words and incorrect sentences in the text. In this article, we hypothesize that explicitly modeling the interference within each modality can improve the quality of common representation learning. To this end, we propose a novel model private-shared subspaces separation (P3S) to explicitly learn different representations that are partitioned into two kinds of subspaces: 1) the common representations that capture the cross-modal correlation in a shared subspace and 2) the private representations that model the interference within each modality in two private subspaces. By employing the orthogonality constraints between the shared subspace and the private subspaces during the one-stage joint learning procedure, our model is able to learn more effective common representations for different modalities in the shared subspace by fully excluding the interference within each modality. Extensive experiments conducted on cross-modal retrieval verify the advantages of our P3S method compared with 15 state-of-the-art methods on four widely used cross-modal datasets.
Collapse
|
22
|
Learning ordinal constraint binary codes for fast similarity search. Inf Process Manag 2022. [DOI: 10.1016/j.ipm.2022.102919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
23
|
Object-Level Visual-Text Correlation Graph Hashing for Unsupervised Cross-Modal Retrieval. SENSORS 2022; 22:s22082921. [PMID: 35458906 PMCID: PMC9029824 DOI: 10.3390/s22082921] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 03/29/2022] [Accepted: 04/07/2022] [Indexed: 02/01/2023]
Abstract
The core of cross-modal hashing methods is to map high dimensional features into binary hash codes, which can then efficiently utilize the Hamming distance metric to enhance retrieval efficiency. Recent development emphasizes the advantages of the unsupervised cross-modal hashing technique, since it only relies on relevant information of the paired data, making it more applicable to real-world applications. However, two problems, that is intro-modality correlation and inter-modality correlation, still have not been fully considered. Intra-modality correlation describes the complex overall concept of a single modality and provides semantic relevance for retrieval tasks, while inter-modality correction refers to the relationship between different modalities. From our observation and hypothesis, the dependency relationship within the modality and between different modalities can be constructed at the object level, which can further improve cross-modal hashing retrieval accuracy. To this end, we propose a Visual-textful Correlation Graph Hashing (OVCGH) approach to mine the fine-grained object-level similarity in cross-modal data while suppressing noise interference. Specifically, a novel intra-modality correlation graph is designed to learn graph-level representations of different modalities, obtaining the dependency relationship of the image region to image region and the tag to tag in an unsupervised manner. Then, we design a visual-text dependency building module that can capture correlation semantic information between different modalities by modeling the dependency relationship between image object region and text tag. Extensive experiments on two widely used datasets verify the effectiveness of our proposed approach.
Collapse
|
24
|
Wu J, Weng W, Fu J, Liu L, Hu B. Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06696-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
25
|
Xie L, Guo W, Wei H, Tang Y, Tao D. Efficient Unsupervised Dimension Reduction for Streaming Multiview Data. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1772-1784. [PMID: 32525809 DOI: 10.1109/tcyb.2020.2996684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Multiview learning has received substantial attention over the past decade due to its powerful capacity in integrating various types of information. Conventional unsupervised multiview dimension reduction (UMDR) methods are usually conducted in an offline manner and may fail in many real-world applications, where data arrive sequentially and the data distribution changes periodically. Moreover, satisfying the requirements of high memory consumption and expensive retraining of the time cost in large-scale scenarios are difficult. To remedy these drawbacks, we propose an online UMDR (OUMDR) framework. OUMDR aims to seek a low-dimensional and informative consensus representation for streaming multiview data. View-specific weights are also learned in this article to reflect the contributions of different views to the final consensus presentation. A specific model called OUMDR-E is developed by introducing the exclusive group LASSO (EG-LASSO) to explore the intraview and interview correlations. Then, we develop an efficient iterative algorithm with limited memory and time cost requirements for optimization, where the convergence of each update is theoretically guaranteed. We evaluate the proposed approach in video-based expression recognition applications. The experimental results demonstrate the superiority of our approach in terms of both effectiveness and efficiency.
Collapse
|
26
|
Khan A, Hayat S, Ahmad M, Wen J, Farooq MU, Fang M, Jiang W. Cross‐modal retrieval based on deep regularized hashing constraints. INT J INTELL SYST 2022. [DOI: 10.1002/int.22853] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Asad Khan
- School of Computer Science and Cyber Engineering Guangzhou University Guangzhou China
| | - Sakander Hayat
- School of Mathematics and Information Sciences Guangzhou University Guangzhou China
| | - Muhammad Ahmad
- Department of Computer Science National University of Computer and Emerging Sciences (NUCES‐FAST) Faisalabad Campus Chiniot Pakistan
| | - Jinyu Wen
- School of Computer Science and Cyber Engineering Guangzhou University Guangzhou China
| | - Muhammad Umar Farooq
- School of Computer Science and Technology University of Science and Technology of China Hefei China
| | - Meie Fang
- School of Computer Science and Cyber Engineering Guangzhou University Guangzhou China
| | - Wenchao Jiang
- School of Computers Guangdong University of Technology Guangzhou China
| |
Collapse
|
27
|
Ji Z, Wang H, Han J, Pang Y. SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1086-1097. [PMID: 32386178 DOI: 10.1109/tcyb.2020.2985716] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods.
Collapse
|
28
|
Zou X, Wu S, Bakker EM, Wang X. Multi-label enhancement based self-supervised deep cross-modal hashing. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.09.053] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
29
|
Zhu J, Shu Y, Zhang J, Wang X, Wu S. Triplet-object loss for large scale deep image retrieval. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-021-01330-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
30
|
Xiang X, Zhang Y, Jin L, Li Z, Tang J. Sub-Region Localized Hashing for Fine-Grained Image Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 31:314-326. [PMID: 34871171 DOI: 10.1109/tip.2021.3131042] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Fine-grained image hashing is challenging due to the difficulties of capturing discriminative local information to generate hash codes. On the one hand, existing methods usually extract local features with the dense attention mechanism by focusing on dense local regions, which cannot contain diverse local information for fine-grained hashing. On the other hand, hash codes of the same class suffer from large intra-class variation of fine-grained images. To address the above problems, this work proposes a novel sub-Region Localized Hashing (sRLH) to learn intra-class compact and inter-class separable hash codes that also contain diverse subtle local information for efficient fine-grained image retrieval. Specifically, to localize diverse local regions, a sub-region localization module is developed to learn discriminative local features by locating the peaks of non-overlap sub-regions in the feature map. Different from localizing dense local regions, these peaks can guide the sub-region localization module to capture multifarious local discriminative information by paying close attention to dispersive local regions. To mitigate intra-class variations, hash codes of the same class are enforced to approach one common binary center. Meanwhile, the gram-schmidt orthogonalization is performed on the binary centers to make the hash codes inter-class separable. Extensive experimental results on four widely used fine-grained image retrieval datasets demonstrate the superiority of sRLH to several state-of-the-art methods. The source code of sRLH will be released at https://github.com/ZhangYajie-NJUST/sRLH.git.
Collapse
|
31
|
Chen Y, Lu X. Deep Category-Level and Regularized Hashing With Global Semantic Similarity Learning. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:6240-6252. [PMID: 32112686 DOI: 10.1109/tcyb.2020.2964993] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The hashing technique has been extensively used in large-scale image retrieval applications due to its low storage and fast computing speed. Most existing deep hashing approaches cannot fully consider the global semantic similarity and category-level semantic information, which result in the insufficient utilization of the global semantic similarity for hash codes learning and the semantic information loss of hash codes. To tackle these issues, we propose a novel deep hashing approach with triplet labels, namely, deep category-level and regularized hashing (DCRH), to leverage the global semantic similarity of deep feature and category-level semantic information to enhance the semantic similarity of hash codes. There are four contributions in this article. First, we design a novel global semantic similarity constraint about the deep feature to make the anchor deep feature more similar to the positive deep feature than to the negative deep feature. Second, we leverage label information to enhance category-level semantics of hash codes for hash codes learning. Third, we develop a new triplet construction module to select good image triplets for effective hash functions learning. Finally, we propose a new triplet regularized loss (Reg-L) term, which can force binary-like codes to approximate binary codes and eventually minimize the information loss between binary-like codes and binary codes. Extensive experimental results in three image retrieval benchmark datasets show that the proposed DCRH approach achieves superior performance over other state-of-the-art hashing approaches.
Collapse
|
32
|
Feng H, Wang N, Tang J. Deep Weibull hashing with maximum mean discrepancy quantization for image retrieval. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.08.090] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
33
|
Paul S, Mithun NC, Roy-Chowdhury AK. Text-Based Localization of Moments in a Video Corpus. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:8886-8899. [PMID: 34665727 DOI: 10.1109/tip.2021.3120038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: 2) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, 2) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries. Qualitative and quantitative results on three benchmark text-based video moment retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions - demonstrate that our method achieves promising performance on the proposed task of temporal localization of moments in a corpus of videos.
Collapse
|
34
|
Tian X, Ng WWY, Wang H. Concept Preserving Hashing for Semantic Image Retrieval With Concept Drift. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:5184-5197. [PMID: 31841431 DOI: 10.1109/tcyb.2019.2955130] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Current hashing-based image retrieval methods mostly assume that the database of images is static. However, this assumption is not true in cases where the databases are constantly updated (e.g., on the Internet) and there exists the problem of concept drift. The online (also known as incremental) hashing methods have been proposed recently for image retrieval where the database is not static. However, they have not considered the concept drift problem. Moreover, they update hash functions dynamically by generating new hash codes for all accumulated data over time which is clearly uneconomical. In order to solve these two problems, concept preserving hashing (CPH) is proposed. In contrast to the existing methods, CPH preserves the original concept, that is, the set of hash codes representing a concept is preserved over time, by learning a new set of hash functions to yield the same set of hash codes for images (old and new) of a concept. The objective function of CPH learning consists of three components: 1) isomorphic similarity; 2) hash codes partition balancing; and 3) heterogeneous similarity fitness. The experimental results on 11 concept drift scenarios show that CPH yields better retrieval precisions than the existing methods and does not need to update hash codes of previously stored images.
Collapse
|
35
|
Hu P, Peng X, Zhu H, Lin J, Zhen L, Peng D. Joint Versus Independent Multiview Hashing for Cross-View Retrieval. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4982-4993. [PMID: 33119532 DOI: 10.1109/tcyb.2020.3027614] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Thanks to the low storage cost and high query speed, cross-view hashing (CVH) has been successfully used for similarity search in multimedia retrieval. However, most existing CVH methods use all views to learn a common Hamming space, thus making it difficult to handle the data with increasing views or a large number of views. To overcome these difficulties, we propose a decoupled CVH network (DCHN) approach which consists of a semantic hashing autoencoder module (SHAM) and multiple multiview hashing networks (MHNs). To be specific, SHAM adopts a hashing encoder and decoder to learn a discriminative Hamming space using either a few labels or the number of classes, that is, the so-called flexible inputs. After that, MHN independently projects all samples into the discriminative Hamming space that is treated as an alternative ground truth. In brief, the Hamming space is learned from the semantic space induced from the flexible inputs, which is further used to guide view-specific hashing in an independent fashion. Thanks to such an independent/decoupled paradigm, our method could enjoy high computational efficiency and the capacity of handling the increasing number of views by only using a few labels or the number of classes. For a newly coming view, we only need to add a view-specific network into our model and avoid retraining the entire model using the new and previous views. Extensive experiments are carried out on five widely used multiview databases compared with 15 state-of-the-art approaches. The results show that the proposed independent hashing paradigm is superior to the common joint ones while enjoying high efficiency and the capacity of handling newly coming views.
Collapse
|
36
|
|
37
|
Yang Z, Yang L, Huang W, Sun L, Long J. Enhanced Deep Discrete Hashing with semantic-visual similarity for image retrieval. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2021.102648] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
38
|
Qin Q, Huang L, Wei Z, Nie J, Xie K, Hou J. Unsupervised Deep Quadruplet Hashing with Isometric Quantization for image retrieval. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.03.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
39
|
|
40
|
Li M, Li Q, Tang L, Peng S, Ma Y, Yang D. Deep Unsupervised Hashing for Large-Scale Cross-Modal Retrieval Using Knowledge Distillation Model. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:5107034. [PMID: 34326867 PMCID: PMC8310450 DOI: 10.1155/2021/5107034] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 07/08/2021] [Indexed: 11/18/2022]
Abstract
Cross-modal hashing encodes heterogeneous multimedia data into compact binary code to achieve fast and flexible retrieval across different modalities. Due to its low storage cost and high retrieval efficiency, it has received widespread attention. Supervised deep hashing significantly improves search performance and usually yields more accurate results, but requires a lot of manual annotation of the data. In contrast, unsupervised deep hashing is difficult to achieve satisfactory performance due to the lack of reliable supervisory information. To solve this problem, inspired by knowledge distillation, we propose a novel unsupervised knowledge distillation cross-modal hashing method based on semantic alignment (SAKDH), which can reconstruct the similarity matrix using the hidden correlation information of the pretrained unsupervised teacher model, and the reconstructed similarity matrix can be used to guide the supervised student model. Specifically, firstly, the teacher model adopted an unsupervised semantic alignment hashing method, which can construct a modal fusion similarity matrix. Secondly, under the supervision of teacher model distillation information, the student model can generate more discriminative hash codes. Experimental results on two extensive benchmark datasets (MIRFLICKR-25K and NUS-WIDE) show that compared to several representative unsupervised cross-modal hashing methods, the mean average precision (MAP) of our proposed method has achieved a significant improvement. It fully reflects its effectiveness in large-scale cross-modal data retrieval.
Collapse
Affiliation(s)
- Mingyong Li
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
| | - Qiqi Li
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
| | - Lirong Tang
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
| | - Shuang Peng
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
| | - Yan Ma
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
| | - Degang Yang
- College of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China
| |
Collapse
|
41
|
Quadruplet-Based Deep Cross-Modal Hashing. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:9968716. [PMID: 34306059 PMCID: PMC8270718 DOI: 10.1155/2021/9968716] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 05/24/2021] [Accepted: 06/14/2021] [Indexed: 12/02/2022]
Abstract
Recently, benefitting from the storage and retrieval efficiency of hashing and the powerful discriminative feature extraction capability of deep neural networks, deep cross-modal hashing retrieval has drawn more and more attention. To preserve the semantic similarities of cross-modal instances during the hash mapping procedure, most existing deep cross-modal hashing methods usually learn deep hashing networks with a pairwise loss or a triplet loss. However, these methods may not fully explore the similarity relation across modalities. To solve this problem, in this paper, we introduce a quadruplet loss into deep cross-modal hashing and propose a quadruplet-based deep cross-modal hashing (termed QDCMH) method. Extensive experiments on two benchmark cross-modal retrieval datasets show that our proposed method achieves state-of-the-art performance and demonstrate the efficiency of the quadruplet loss in cross-modal hashing.
Collapse
|
42
|
|
43
|
Visible-infrared cross-modality person re-identification based on whole-individual training. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.01.073] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
44
|
Chen S, Wu S, Wang L. Hierarchical semantic interaction-based deep hashing network for cross-modal retrieval. PeerJ Comput Sci 2021; 7:e552. [PMID: 34141884 PMCID: PMC8176532 DOI: 10.7717/peerj-cs.552] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Accepted: 04/28/2021] [Indexed: 06/12/2023]
Abstract
Due to the high efficiency of hashing technology and the high abstraction of deep networks, deep hashing has achieved appealing effectiveness and efficiency for large-scale cross-modal retrieval. However, how to efficiently measure the similarity of fine-grained multi-labels for multi-modal data and thoroughly explore the intermediate layers specific information of networks are still two challenges for high-performance cross-modal hashing retrieval. Thus, in this paper, we propose a novel Hierarchical Semantic Interaction-based Deep Hashing Network (HSIDHN) for large-scale cross-modal retrieval. In the proposed HSIDHN, the multi-scale and fusion operations are first applied to each layer of the network. A Bidirectional Bi-linear Interaction (BBI) policy is then designed to achieve the hierarchical semantic interaction among different layers, such that the capability of hash representations can be enhanced. Moreover, a dual-similarity measurement ("hard" similarity and "soft" similarity) is designed to calculate the semantic similarity of different modality data, aiming to better preserve the semantic correlation of multi-labels. Extensive experiment results on two large-scale public datasets have shown that the performance of our HSIDHN is competitive to state-of-the-art deep cross-modal hashing methods.
Collapse
Affiliation(s)
- Shubai Chen
- College of Computer and Information Science, Southwest University, Chongqing, People’s Republic of China
| | - Song Wu
- College of Computer and Information Science, Southwest University, Chongqing, People’s Republic of China
| | - Li Wang
- College of Electronic and Information Engineering, Southwest University, Chongqing, People’s Republic of China
| |
Collapse
|
45
|
Yang Z, Yang L, Raymond OI, Zhu L, Huang W, Liao Z, Long J. NSDH: A Nonlinear Supervised Discrete Hashing framework for large-scale cross-modal retrieval. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106818] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
46
|
Fang Y, Li B, Li X, Ren Y. Unsupervised cross-modal similarity via Latent Structure Discrete Hashing Factorization. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106857] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
47
|
Zheng D, Fan J, Han M. Hybrid Regularization of Diffusion Process for Visual Re-Ranking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:3705-3719. [PMID: 33705317 DOI: 10.1109/tip.2021.3064265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
To improve the retrieval result obtained from a pairwise dissimilarity, many variants of diffusion process have been applied in visual re-ranking. In the framework of diffusion process, various contextual similarities can be obtained by solving an optimization problem, and the objective function consists of a smoothness constraint and a fitting constraint. And many improvements on the smoothness constraint have been made to reveal the underlying manifold structure. However, little attention has been paid to the fitting constraint, and how to build an effective fitting constraint still remains unclear. In this article, by deeply analyzing the role of fitting constraint, we firstly propose a novel variant of diffusion process named Hybrid Regularization of Diffusion Process (HyRDP). In HyRDP, we introduce a hybrid regularization framework containing a two-part fitting constraint, and the contextual dissimilarities can be learned from either a closed-form solution or an iterative solution. Furthermore, this article indicates that the basic idea of HyRDP is closely related to the mechanism behind Generalized Mean First-passage Time (GMFPT). GMFPT denotes the mean time-steps for the state transition from one state to any one in the given state set, and is firstly introduced as the contextual dissimilarity in this article. Finally, based on the semi-supervised learning framework, an iterative re-ranking process is developed. With this approach, the relevant objects on the manifold can be iteratively retrieved and labeled within finite iterations. The proposed algorithms are validated on various challenging databases, and the experimental performances demonstrate that retrieval results obtained from different types of measures can be effectively improved by using our methods.
Collapse
|
48
|
Zheng Y, Jiang Z, Xie F, Shi J, Zhang H, Huai J, Cao M, Yang X. Diagnostic Regions Attention Network (DRA-Net) for Histopathology WSI Recommendation and Retrieval. IEEE TRANSACTIONS ON MEDICAL IMAGING 2021; 40:1090-1103. [PMID: 33351756 DOI: 10.1109/tmi.2020.3046636] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The development of whole slide imaging techniques and online digital pathology platforms have accelerated the popularization of telepathology for remote tumor diagnoses. During a diagnosis, the behavior information of the pathologist can be recorded by the platform and then archived with the digital case. The browsing path of the pathologist on the WSI is one of the valuable information in the digital database because the image content within the path is expected to be highly correlated with the diagnosis report of the pathologist. In this article, we proposed a novel approach for computer-assisted cancer diagnosis named session-based histopathology image recommendation (SHIR) based on the browsing paths on WSIs. To achieve the SHIR, we developed a novel diagnostic regions attention network (DRA-Net) to learn the pathology knowledge from the image content associated with the browsing paths. The DRA-Net does not rely on the pixel-level or region-level annotations of pathologists. All the data for training can be automatically collected by the digital pathology platform without interrupting the pathologists' diagnoses. The proposed approaches were evaluated on a gastric dataset containing 983 cases within 5 categories of gastric lesions. The quantitative and qualitative assessments on the dataset have demonstrated the proposed SHIR framework with the novel DRA-Net is effective in recommending diagnostically relevant cases for auxiliary diagnosis. The MRR and MAP for the recommendation are respectively 0.816 and 0.836 on the gastric dataset. The source code of the DRA-Net is available at https://github.com/zhengyushan/dpathnet.
Collapse
|
49
|
Qi M, Qin J, Yang Y, Wang Y, Luo J. Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:2989-3004. [PMID: 33560984 DOI: 10.1109/tip.2020.3048680] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
With the current exponential growth of video-based social networks, video retrieval using natural language is receiving ever-increasing attention. Most existing approaches tackle this task by extracting individual frame-level spatial features to represent the whole video, while ignoring visual pattern consistencies and intrinsic temporal relationships across different frames. Furthermore, the semantic correspondence between natural language queries and person-centric actions in videos has not been fully explored. To address these problems, we propose a novel binary representation learning framework, named Semantics-aware Spatial-temporal Binaries ( [Formula: see text]Bin), which simultaneously considers spatial-temporal context and semantic relationships for cross-modal video retrieval. By exploiting the semantic relationships between two modalities, [Formula: see text]Bin can efficiently and effectively generate binary codes for both videos and texts. In addition, we adopt an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training. We evaluate our model on three video datasets and the experimental results demonstrate that [Formula: see text]Bin outperforms the state-of-the-art methods in terms of various cross-modal video retrieval tasks.
Collapse
|
50
|
Chen W, Wang W, Liu L, Lew MS. New Ideas and Trends in Deep Multimodal Content Understanding: A Review. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.10.042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|