1
|
Shen X, Chen Y, Liu W, Zheng Y, Sun QS, Pan S. Graph Convolutional Multi-Label Hashing for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7997-8009. [PMID: 39028597 DOI: 10.1109/tnnls.2024.3421583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/21/2024]
Abstract
Cross-modal hashing encodes different modalities of multimodal data into low-dimensional Hamming space for fast cross-modal retrieval. In multi-label cross-modal retrieval, multimodal data are often annotated with multiple labels, and some labels, e.g., "ocean" and "cloud," often co-occur. However, existing cross-modal hashing methods overlook label dependency that is crucial for improving performance. To fulfill this gap, this article proposes graph convolutional multi-label hashing (GCMLH) for effective multi-label cross-modal retrieval. Specifically, GCMLH first generates word embedding of each label and develops label encoder to learn highly correlated label embedding via graph convolutional network (GCN). In addition, GCMLH develops feature encoder for each modality, and feature fusion module to generate highly semantic feature via GCN. GCMLH uses teacher-student learning scheme to transfer knowledge from the teacher modules, i.e., label encoder and feature fusion module, to the student module, i.e., feature encoder, such that learned hash code can well exploit multi-label dependency and multimodal semantic structure. Extensive empirical results on several benchmarks demonstrate the superiority of the proposed method over existing state-of-the-arts.
Collapse
|
2
|
Ge J, Liu Z, Li P, Xie L, Zhang Y, Tian Q, Xie H. Denoised and Dynamic Alignment Enhancement for Zero-Shot Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:1501-1515. [PMID: 40031700 DOI: 10.1109/tip.2025.3544481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Zero-shot learning (ZSL) focuses on recognizing unseen categories by aligning visual features with semantic information. Recent advancements have shown that aligning each attribute with its corresponding visual region significantly improves zero-shot learning performance. However, the crude semantic proxies used in these methods fail to capture the varied appearances of each attribute, and are also easily confused by the presence of semantically redundant backgrounds, leading to suboptimal alignment. To combat these issues, we introduce a novel Alignment-Enhanced Network (AENet), designed to denoise the visual features and dynamically perceive semantic information, thus enhancing visual-semantic alignment. Our approach comprises two key innovations. (1) A visual denoising encoder, employing a class-agnostic mask to filter out semantically redundant visual information, thus producing refined visual features adaptable to unseen classes. (2) A dynamic semantic generator that crafts content-aware semantic proxies adaptively, steered by visual features, enabling AENet to discriminate fine-grained variations in visual contents. Additionally, we integrate a cross-fusion module to ensure comprehensive interaction between the denoised visual features and the generated dynamic semantic proxies, further facilitating visual-semantic alignment. Through extensive experiments across three datasets, the proposed method demonstrates that it narrows down the visual-semantic gap and sets a new benchmark in this setting.
Collapse
|
3
|
Jin L, Li Z, Pan Y, Tang J. Relational Consistency Induced Self-Supervised Hashing for Image Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1482-1494. [PMID: 37995167 DOI: 10.1109/tnnls.2023.3333294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2023]
Abstract
This article proposes a new hashing framework named relational consistency induced self-supervised hashing (RCSH) for large-scale image retrieval. To capture the potential semantic structure of data, RCSH explores the relational consistency between data samples in different spaces, which learns reliable data relationships in the latent feature space and then preserves the learned relationships in the Hamming space. The data relationships are uncovered by learning a set of prototypes that group similar data samples in the latent feature space. By uncovering the semantic structure of the data, meaningful data-to-prototype and data-to-data relationships are jointly constructed. The data-to-prototype relationships are captured by constraining the prototype assignments generated from different augmented views of an image to be the same. Meanwhile, these data-to-prototype relationships are preserved to learn informative compact hash codes by matching them with these reliable prototypes. To accomplish this, a novel dual prototype contrastive loss is proposed to maximize the agreement of prototype assignments in the latent feature space and Hamming space. The data-to-data relationships are captured by enforcing the distribution of pairwise similarities in the latent feature space and Hamming space to be consistent, which makes the learned hash codes preserve meaningful similarity relationships. Extensive experimental results on four widely used image retrieval datasets demonstrate that the proposed method significantly outperforms the state-of-the-art methods. Besides, the proposed method achieves promising performance in out-of-domain retrieval tasks, which shows its good generalization ability. The source code and models are available at https://github.com/IMAG-LuJin/RCSH.
Collapse
|
4
|
Liu Y, Liu H, Wang H, Meng F, Liu M. BCAN: Bidirectional Correct Attention Network for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14247-14258. [PMID: 37256811 DOI: 10.1109/tnnls.2023.3276796] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
As a fundamental topic in bridging the gap between vision and language, cross-modal retrieval purposes to obtain the correspondences' relationship between fragments, i.e., subregions in images and words in texts. Compared with earlier methods that focus on learning the visual semantic embedding from images and sentences to the shared embedding space, the existing methods tend to learn the correspondences between words and regions via cross-modal attention. However, such attention-based approaches invariably result in semantic misalignment between subfragments for two reasons: 1) without modeling the relationship between subfragments and the semantics of the entire images or sentences, it will be hard for such approaches to distinguish images or sentences with multiple same semantic fragments and 2) such approaches focus attention evenly on all subfragments, including nonvisual words and a lot of redundant regions, which also will face the problem of semantic misalignment. To solve these problems, this article proposes a bidirectional correct attention network (BCAN), which introduces a novel concept of the relevance between subfragments and the semantics of the entire images or sentences and designs a novel correct attention mechanism by modeling the local and global similarity between images and sentences to correct the attention weights focused on the wrong fragments. Specifically, we introduce a concept about the semantic relationship between subfragments and entire images or sentences and use this concept to solve the semantic misalignment from two aspects. In our correct attention mechanism, we design two independent units to correct the weight of attention focused on the wrong fragments. Global correct unit (GCU) with modeling the global similarity between images and sentences into the attention mechanism to solve the semantic misalignment problem caused by focusing attention on relevant subfragments in irrelevant pairs (RI) and the local correct unit (LCU) consider the difference in the attention weights between fragments among two steps to solve the semantic misalignment problem caused by focusing attention on irrelevant subfragments in relevant pairs (IR). Extensive experiments on large-scale MS-COCO and Flickr30K show that our proposed method outperforms all the attention-based methods and is competitive to the state-of-the-art. Our code and pretrained model are publicly available at: https://github.com/liuyyy111/BCAN.
Collapse
|
5
|
Liu H, Zhou W, Zhang H, Li G, Zhang S, Li X. Bit Reduction for Locality-Sensitive Hashing. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12470-12481. [PMID: 37037245 DOI: 10.1109/tnnls.2023.3263195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Locality-sensitive hashing (LSH) has gained ever-increasing popularity in similarity search for large-scale data. It has competitive search performance when the number of generated hash bits is large, reversely bringing adverse dilemmas for its wide applications. The first purpose of this work is to introduce a novel hash bit reduction schema for hashing techniques to derive shorter binary codes, which has not yet received sufficient concerns. To briefly show how the reduction schema works, the second purpose is to present an effective bit reduction method for LSH under the reduction schema. Specifically, after the hash bits are generated by LSH, they will be put into bit pool as candidates. Then mutual information and data labels are exploited to measure the correlation and structural properties between the hash bits, respectively. Eventually, highly correlated and redundant hash bits can be distinguished and then removed accordingly, without deteriorating the performance greatly. The advantages of our reduction method include that it can not only reduce the number of hash bits effectively but also boost retrieval performance of LSH, making it more appealing and practical in real-world applications. Comprehensive experiments were conducted on three public real-world datasets. The experimental results with representative bit selection methods and the state-of-the-art hashing algorithms demonstrate that the proposed method has encouraging and competitive performance.
Collapse
|
6
|
Ji Z, An P, Liu X, Gao C, Pang Y, Shao L. Semantic-Aware Dynamic Generation Networks for Few-Shot Human-Object Interaction Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12564-12575. [PMID: 37037250 DOI: 10.1109/tnnls.2023.3263660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recognizing human-object interaction (HOI) aims at inferring various relationships between actions and objects. Although great progress in HOI has been made, the long-tail problem and combinatorial explosion problem are still practical challenges. To this end, we formulate HOI as a few-shot task to tackle both challenges and design a novel dynamic generation method to address this task. The proposed approach is called semantic-aware dynamic generation networks (SADG-Nets). Specifically, SADG-Net first assigns semantic-aware task representations for different batches of data, which further generates dynamic parameters. It obtains the features that highlight intercategory discriminability and intracategory commonality adaptively. In addition, we also design a dual semantic-aware encoder module (DSAE-Module), that is, verb-aware and noun-aware branches, to yield both action and object prototypes of HOI for each task space, which generalizes to novel combinations by transferring similarities among interactions. Extensive experimental results on two benchmark datasets, that is, humans interacting with common objects (HICO)-FS and trento universal HOI (TUHOI)-FS, illustrate that our SADG-Net achieves superior performance over state-of-the-art approaches, which proves its impressive effectiveness on few-shot HOI recognition.
Collapse
|
7
|
Chen B, Deng W, Wang B, Zhang L. Confusion-Based Metric Learning for Regularizing Zero-Shot Image Retrieval and Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:1884-1897. [PMID: 35834457 DOI: 10.1109/tnnls.2022.3185668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Deep metric learning turns to be attractive in zero-shot image retrieval and clustering (ZSRC) task in which a good embedding/metric is requested such that the unseen classes can be distinguished well. Most existing works deem this "good" embedding just to be the discriminative one and race to devise the powerful metric objectives or the hard-sample mining strategies for learning discriminative deep metrics. However, in this article, we first emphasize that the generalization ability is also a core ingredient of this "good" metric and it largely affects the metric performance in zero-shot settings as a matter of fact. Then, we propose the confusion-based metric learning (CML) framework to explicitly optimize a robust metric. It is mainly achieved by introducing two interesting regularization terms, i.e., the energy confusion (EC) and diversity confusion (DC) terms. These terms daringly break away from the traditional deep metric learning idea of designing discriminative objectives and instead seek to "confuse" the learned model. These two confusion terms focus on local and global feature distribution confusions, respectively. We train these confusion terms together with the conventional deep metric objective in an adversarial manner. Although it seems weird to "confuse" the model learning, we show that our CML indeed serves as an efficient regularization framework for deep metric learning and it is applicable to various conventional metric methods. This article empirically and experimentally demonstrates the importance of learning an embedding/metric with good generalization, achieving the state-of-the-art performances on the popular CUB, CARS, Stanford Online Products, and In-Shop datasets for ZSRC tasks.
Collapse
|
8
|
Peng SJ, He Y, Liu X, Cheung YM, Xu X, Cui Z. Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image-Text Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2194-2207. [PMID: 35830398 DOI: 10.1109/tnnls.2022.3188569] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Fine-grained image-text retrieval has been a hot research topic to bridge the vision and languages, and its main challenge is how to learn the semantic correspondence across different modalities. The existing methods mainly focus on learning the global semantic correspondence or intramodal relation correspondence in separate data representations, but which rarely consider the intermodal relation that interactively provide complementary hints for fine-grained semantic correlation learning. To address this issue, we propose a relation-aggregated cross-graph (RACG) model to explicitly learn the fine-grained semantic correspondence by aggregating both intramodal and intermodal relations, which can be well utilized to guide the feature correspondence learning process. More specifically, we first build semantic-embedded graph to explore both fine-grained objects and their relations of different media types, which aim not only to characterize the object appearance in each modality, but also to capture the intrinsic relation information to differentiate intramodal discrepancies. Then, a cross-graph relation encoder is newly designed to explore the intermodal relation across different modalities, which can mutually boost the cross-modal correlations to learn more precise intermodal dependencies. Besides, the feature reconstruction module and multihead similarity alignment are efficiently leveraged to optimize the node-level semantic correspondence, whereby the relation-aggregated cross-modal embeddings between image and text are discriminatively obtained to benefit various image-text retrieval tasks with high retrieval performance. Extensive experiments evaluated on benchmark datasets quantitatively and qualitatively verify the advantages of the proposed framework for fine-grained image-text retrieval and show its competitive performance with the state of the arts.
Collapse
|
9
|
Zeng Y, Wang Y, Liao D, Li G, Huang W, Xu J, Cao D, Man H. Keyword-Based Diverse Image Retrieval With Variational Multiple Instance Graph. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:10528-10537. [PMID: 35482693 DOI: 10.1109/tnnls.2022.3168431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The task of cross-modal image retrieval has recently attracted considerable research attention. In real-world scenarios, keyword-based queries issued by users are usually short and have broad semantics. Therefore, semantic diversity is as important as retrieval accuracy in such user-oriented services, which improves user experience. However, most typical cross-modal image retrieval methods based on single point query embedding inevitably result in low semantic diversity, while existing diverse retrieval approaches frequently lead to low accuracy due to a lack of cross-modal understanding. To address this challenge, we introduce an end-to-end solution termed variational multiple instance graph (VMIG), in which a continuous semantic space is learned to capture diverse query semantics, and the retrieval task is formulated as a multiple instance learning problems to connect diverse features across modalities. Specifically, a query-guided variational autoencoder is employed to model the continuous semantic space instead of learning a single-point embedding. Afterward, multiple instances of the image and query are obtained by sampling in the continuous semantic space and applying multihead attention, respectively. Thereafter, an instance graph is constructed to remove noisy instances and align cross-modal semantics. Finally, heterogeneous modalities are robustly fused under multiple losses. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution in both retrieval accuracy and semantic diversity.
Collapse
|
10
|
Xie GS, Zhang XY, Xiang TZ, Zhao F, Zhang Z, Shao L, Li X. Leveraging Balanced Semantic Embedding for Generative Zero-Shot Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9575-9582. [PMID: 36269927 DOI: 10.1109/tnnls.2022.3208525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Generative (generalized) zero-shot learning [(G)ZSL] models aim to synthesize unseen class features by using only seen class feature and attribute pairs as training data. However, the generated fake unseen features tend to be dominated by the seen class features and thus classified as seen classes, which can lead to inferior performances under zero-shot learning (ZSL), and unbalanced results under generalized ZSL (GZSL). To address this challenge, we tailor a novel balanced semantic embedding generative network (BSeGN), which incorporates balanced semantic embedding learning into generative learning scenarios in the pursuit of unbiased GZSL. Specifically, we first design a feature-to-semantic embedding module (FEM) to distinguish real seen and fake unseen features collaboratively with the generator in an online manner. We introduce the bidirectional contrastive and balance losses for the FEM learning, which can guarantee a balanced prediction for the interdomain features. In turn, the updated FEM can boost the learning of the generator. Next, we propose a multilevel feature integration module (mFIM) from the cycle-consistency branch of BSeGN, which can mitigate the domain bias through feature enhancement. To the best of our knowledge, this is the first work to explore embedding and generative learning jointly within the field of ZSL. Extensive evaluations on four benchmarks demonstrate the superiority of BSeGN over its state-of-the-art counterparts.
Collapse
|
11
|
Hoang T, Do TT, Nguyen TV, Cheung NM. Multimodal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6289-6302. [PMID: 34982698 DOI: 10.1109/tnnls.2021.3135420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this article, we adopt the maximizing mutual information (MI) approach to tackle the problem of unsupervised learning of binary hash codes for efficient cross-modal retrieval. We proposed a novel method, dubbed cross-modal info-max hashing (CMIMH). First, to learn informative representations that can preserve both intramodal and intermodal similarities, we leverage the recent advances in estimating variational lower bound of MI to maximizing the MI between the binary representations and input features and between binary representations of different modalities. By jointly maximizing these MIs under the assumption that the binary representations are modeled by multivariate Bernoulli distributions, we can learn binary representations, which can preserve both intramodal and intermodal similarities, effectively in a mini-batch manner with gradient descent. Furthermore, we find out that trying to minimize the modality gap by learning similar binary representations for the same instance from different modalities could result in less informative representations. Hence, balancing between reducing the modality gap and losing modality-private information is important for the cross-modal retrieval tasks. Quantitative evaluations on standard benchmark datasets demonstrate that the proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
Collapse
|
12
|
Peng L, Qian J, Xu Z, Xin Y, Guo L. Multi-Label Hashing for Dependency Relations Among Multiple Objectives. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1759-1773. [PMID: 37028054 DOI: 10.1109/tip.2023.3251028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Learning hash functions have been widely applied for large-scale image retrieval. Existing methods usually use CNNs to process an entire image at once, which is efficient for single-label images but not for multi-label images. First, these methods cannot fully exploit independent features of different objects in one image, resulting in some small object features with important information being ignored. Second, the methods cannot capture different semantic information from dependency relations among objects. Third, the existing methods ignore the impacts of imbalance between hard and easy training pairs, resulting in suboptimal hash codes. To address these issues, we propose a novel deep hashing method, termed multi-label hashing for dependency relations among multiple objectives (DRMH). We first utilize an object detection network to extract object feature representations to avoid ignoring small object features and then fuse object visual features with position features and further capture dependency relations among objects using a self-attention mechanism. In addition, we design a weighted pairwise hash loss to solve the imbalance problem between hard and easy training pairs. Extensive experiments are conducted on multi-label datasets and zero-shot datasets, and the proposed DRMH outperforms many state-of-the-art hashing methods with respect to different evaluation metrics.
Collapse
|
13
|
Shu Z, Yong K, Yu J, Gao S, Mao C, Yu Z. Discrete asymmetric zero-shot hashing with application to cross-modal retrieval. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
14
|
Fan W, Liang C, Wang T. Contrastive semantic disentanglement in latent space for generalized zero-shot learning. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
15
|
Liu X, Ji Z, Pang Y, Han J, Li X. DGIG-Net: Dynamic Graph-in-Graph Networks for Few-Shot Human-Object Interaction. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:7852-7864. [PMID: 33566778 DOI: 10.1109/tcyb.2021.3049537] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Few-shot learning (FSL) for human-object interaction (HOI) aims at recognizing various relationships between human actions and surrounding objects only from a few samples. It is a challenging vision task, in which the diversity and interactivity of human actions result in great difficulty to learn an adaptive classifier to catch ambiguous interclass information. Therefore, traditional FSL methods usually perform unsatisfactorily in complex HOI scenes. To this end, we propose dynamic graph-in-graph networks (DGIG-Net), a novel graph prototypes framework to learn a dynamic metric space by embedding a visual subgraph to a task-oriented cross-modal graph for few-shot HOI. Specifically, we first build a knowledge reconstruction graph to learn latent representations for HOI categories by reconstructing the relationship among visual features, which generates visual representations under the category distribution of every task. Then, a dynamic relation graph integrates both reconstructible visual nodes and dynamic task-oriented semantic information to explore a graph metric space for HOI class prototypes, which applies the discriminative information from the similarities among actions or objects. We validate DGIG-Net on multiple benchmark datasets, on which it largely outperforms existing FSL approaches and achieves state-of-the-art results.
Collapse
|
16
|
Xie GS, Zhang Z, Liu G, Zhu F, Liu L, Shao L, Li X. Generalized Zero-Shot Learning With Multiple Graph Adaptive Generative Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2903-2915. [PMID: 33493121 DOI: 10.1109/tnnls.2020.3046924] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Generative adversarial networks (GANs) for (generalized) zero-shot learning (ZSL) aim to generate unseen image features when conditioned on unseen class embeddings, each of which corresponds to one unique category. Most existing works on GANs for ZSL generate features by merely feeding the seen image feature/class embedding (combined with random Gaussian noise) pairs into the generator/discriminator for a two-player minimax game. However, the structure consistency of the distributions among the real/fake image features, which may shift the generated features away from their real distribution to some extent, is seldom considered. In this paper, to align the weights of the generator for better structure consistency between real/fake features, we propose a novel multigraph adaptive GAN (MGA-GAN). Specifically, a Wasserstein GAN equipped with a classification loss is trained to generate discriminative features with structure consistency. MGA-GAN leverages the multigraph similarity structures between sliced seen real/fake feature samples to assist in updating the generator weights in the local feature manifold. Moreover, correlation graphs for the whole real/fake features are adopted to guarantee structure correlation in the global feature manifold. Extensive evaluations on four benchmarks demonstrate well the superiority of MGA-GAN over its state-of-the-art counterparts.
Collapse
|
17
|
Xu Y, Mu L, Ji Z, Liu X, Han J. Meta hyperbolic networks for zero-shot learning. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.03.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
18
|
Feng L, Zhao C, Li X. Bias-Eliminated Semantic Refinement for Any-Shot Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2229-2244. [PMID: 35213308 DOI: 10.1109/tip.2022.3152631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
When training samples are scarce, the semantic embedding technique, i. e., describing class labels with attributes, provides a condition to generate visual features for unseen objects by transferring the knowledge from seen objects. However, semantic descriptions are usually obtained in an external paradigm, such as manual annotation, resulting in weak consistency between descriptions and visual features. In this paper, we refine the coarse-grained semantic description for any-shot learning tasks, i. e., zero-shot learning (ZSL), generalized zero-shot learning (GZSL), and few-shot learning (FSL). A new model, namely, the semantic refinement Wasserstein generative adversarial network (SRWGAN) model, is designed with the proposed multihead representation and hierarchical alignment techniques. Unlike conventional methods, semantic refinement is performed with the aim of identifying a bias-eliminated condition for disjoint-class feature generation and is applicable in both inductive and transductive settings. We extensively evaluate model performance on six benchmark datasets and observe state-of-the-art results for any-shot learning; e. g., we obtain 70.2% harmonic accuracy for the Caltech UCSD Birds (CUB) dataset and 82.2% harmonic accuracy for the Oxford Flowers (FLO) dataset in the standard GZSL setting. Various visualizations are also provided to show the bias-eliminated generation of SRWGAN. Our code is available. 1.
Collapse
|
19
|
Ji Z, Hou Z, Liu X, Pang Y, Han J. Information Symmetry Matters: A Modal-Alternating Propagation Network for Few-Shot Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1520-1531. [PMID: 35050856 DOI: 10.1109/tip.2022.3143005] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Semantic information provides intra-class consistency and inter-class discriminability beyond visual concepts, which has been employed in Few-Shot Learning (FSL) to achieve further gains. However, semantic information is only available for labeled samples but absent for unlabeled samples, in which the embeddings are rectified unilaterally by guiding the few labeled samples with semantics. Therefore, it is inevitable to bring a cross-modal bias between semantic-guided samples and nonsemantic-guided samples, which results in an information asymmetry problem. To address this problem, we propose a Modal-Alternating Propagation Network (MAP-Net) to supplement the absent semantic information of unlabeled samples, which builds information symmetry among all samples in both visual and semantic modalities. Specifically, the MAP-Net transfers the neighbor information by the graph propagation to generate the pseudo-semantics for unlabeled samples guided by the completed visual relationships and rectify the feature embeddings. In addition, due to the large discrepancy between visual and semantic modalities, we design a Relation Guidance (RG) strategy to guide the visual relation vectors via semantics so that the propagated information is more beneficial. Extensive experimental results on three semantic-labeled datasets, i.e., Caltech-UCSD-Birds 200-2011, SUN Attribute Database and Oxford 102 Flower, have demonstrated that our proposed method achieves promising performance and outperforms the state-of-the-art approaches, which indicates the necessity of information symmetry.
Collapse
|
20
|
Coordinating Experience Replay: A Harmonious Experience Retention approach for Continual Learning. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107589] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
21
|
|
22
|
Lai N, Kan M, Han C, Song X, Shan S. Learning to Learn Adaptive Classifier-Predictor for Few-Shot Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:3458-3470. [PMID: 32755872 DOI: 10.1109/tnnls.2020.3011526] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Few-shot learning aims to learn a well-performing model from a few labeled examples. Recently, quite a few works propose to learn a predictor to directly generate model parameter weights with episodic training strategy of meta-learning and achieve fairly promising performance. However, the predictor in these works is task-agnostic, which means that the predictor cannot adjust to novel tasks in the testing phase. In this article, we propose a novel meta-learning method to learn how to learn task-adaptive classifier-predictor to generate classifier weights for few-shot classification. Specifically, a meta classifier-predictor module, (MPM) is introduced to learn how to adaptively update a task-agnostic classifier-predictor to a task-specialized one on a novel task with a newly proposed center-uniqueness loss function. Compared with previous works, our task-adaptive classifier-predictor can better capture characteristics of each category in a novel task and thus generate a more accurate and effective classifier. Our method is evaluated on two commonly used benchmarks for few-shot classification, i.e., miniImageNet and tieredImageNet. Ablation study verifies the necessity of learning task-adaptive classifier-predictor and the effectiveness of our newly proposed center-uniqueness loss. Moreover, our method achieves the state-of-the-art performance on both benchmarks, thus demonstrating its superiority.
Collapse
|
23
|
Xu Y, Han C, Qin J, Xu X, Han G, He S. Transductive Zero-Shot Action Recognition via Visually Connected Graph Convolutional Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:3761-3769. [PMID: 32822308 DOI: 10.1109/tnnls.2020.3015848] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With the explosive growth of action categories, zero-shot action recognition aims to extend a well-trained model to novel/unseen classes. To bridge the large knowledge gap between seen and unseen classes, in this brief, we visually associate unseen actions with seen categories in a visually connected graph, and the knowledge is then transferred from the visual features space to semantic space via the grouped attention graph convolutional networks (GAGCNs). In particular, we extract visual features for all the actions, and a visually connected graph is built to attach seen actions to visually similar unseen categories. Moreover, the proposed grouped attention mechanism exploits the hierarchical knowledge in the graph so that the GAGCN enables propagating the visual-semantic connections from seen actions to unseen ones. We extensively evaluate the proposed method on three data sets: HMDB51, UCF101, and NTU RGB + D. Experimental results show that the GAGCN outperforms state-of-the-art methods.
Collapse
|
24
|
Lee J, Kim H, Byun H. Sequence feature generation with temporal unrolling network for zero-shot action recognition. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.03.070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
25
|
Wan S, Hou Y, Bao F, Ren Z, Dong Y, Dai Q, Deng Y. Human-in-the-Loop Low-Shot Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:3287-3292. [PMID: 32813663 DOI: 10.1109/tnnls.2020.3011559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We consider a human-in-the-loop scenario in the context of low-shot learning. Our approach was inspired by the fact that the viability of samples in novel categories cannot be sufficiently reflected by those limited observations. Some heterogeneous samples that are quite different from existing labeled novel data can inevitably emerge in the testing phase. To this end, we consider augmenting an uncertainty assessment module into low-shot learning system to account into the disturbance of those out-of-distribution (OOD) samples. Once detected, these OOD samples are passed to human beings for active labeling. Due to the discrete nature of this uncertainty assessment process, the whole Human-In-the-Loop Low-shot (HILL) learning framework is not end-to-end trainable. We hence revisited the learning system from the aspect of reinforcement learning and introduced the REINFORCE algorithm to optimize model parameters via policy gradient. The whole system gains noticeable improvements over existing low-shot learning approaches.
Collapse
|
26
|
Ji Z, Wang Q, Cui B, Pang Y, Cao X, Li X. A semi-supervised zero-shot image classification method based on soft-target. Neural Netw 2021; 143:88-96. [PMID: 34102379 DOI: 10.1016/j.neunet.2021.05.019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Revised: 03/17/2021] [Accepted: 05/16/2021] [Indexed: 10/21/2022]
Abstract
Zero-shot learning (ZSL) aims at training a classification model with data only from seen categories to recognize data from disjoint unseen categories. Domain shift and generalization capability are two fundamental challenges in ZSL. In this paper, we address them with a novel Soft-Target Semi-supervised Classification (STSC) model. Specifically, an autoencoder network is leveraged, where both labeled seen data from the seen categories and unlabeled ancillary data collected from Internet or other datasets are employed as two branches, respectively. For the branch of labeled seen data, side information are employed as the latent vectors to separately connect the input of encoder and the output of decoder. In this way, visual and side information are implicitly aligned. For the branch of unlabeled ancillary data, it explicitly strengthens the reconstruction ability of the network. Meanwhile, these ancillary data can be viewed as a smooth to the domain distribution, which contributes to the alleviation of the domain shift problem. To further guarantee the generation ability, a Softmax-T loss function is proposed by making full use of the soft target. Extensive experiments on three benchmark datasets show the superiority of the proposed approach under tasks of both traditional zero-shot learning and generalized zero-shot learning.
Collapse
Affiliation(s)
- Zhong Ji
- School of Electrical and Information Engineering, Tianjin University, Tianjin, China.
| | - Qiang Wang
- School of Electrical and Information Engineering, Tianjin University, Tianjin, China.
| | - Biying Cui
- School of Electrical and Information Engineering, Tianjin University, Tianjin, China.
| | - Yanwei Pang
- School of Electrical and Information Engineering, Tianjin University, Tianjin, China.
| | - Xianbin Cao
- School of Electronic and Information Engineering, Beihang University, Beijing, China.
| | - Xuelong Li
- Center for OPTical IMagery Analysis and Learning, Northwestern Polytechnical University, Xi'an, China.
| |
Collapse
|
27
|
Feng L, Zhao C. Transfer Increment for Generalized Zero-Shot Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:2506-2520. [PMID: 32663133 DOI: 10.1109/tnnls.2020.3006322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Zero-shot learning (ZSL) is a successful paradigm for categorizing objects from the previously unseen classes. However, it suffers from severe performance degradation in the generalized ZSL (GZSL) setting, i.e., to recognize the test images that are from both seen and unseen classes. In this article, we present a simple but effective mechanism for GZSL and more open scenarios based on a transfer-increment strategy. On the one hand, a dual-knowledge-source-based generative model is constructed to tackle the missing data problem. Specifically, the local relational knowledge extracted from the label-embedding space and the global relational knowledge, which is the estimated data center in the feature-embedding space, are concurrently considered to synthesize the virtual exemplars. On the other hand, we further explore the training issue for the generative models under the GZSL setting. Two incremental training modes are designed to learn directly the unseen classes from the synthesized exemplars instead of the training classifiers with the seen and synthesized unseen exemplars together. It not only presents an effective unseen class learning but also requires less computing and storage resources in practical application. Comprehensive experiments are conducted based on five benchmark data sets. In comparison with the state-of-the-art methods, both the generating and training processes are considered for virtual exemplars by the proposed transfer-increment strategy, which results in a significant improvement in the conventional and GZSL tasks.
Collapse
|
28
|
Zero-Shot Image Classification Based on a Learnable Deep Metric. SENSORS 2021; 21:s21093241. [PMID: 34067100 PMCID: PMC8124744 DOI: 10.3390/s21093241] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 04/19/2021] [Accepted: 05/01/2021] [Indexed: 11/18/2022]
Abstract
The supervised model based on deep learning has made great achievements in the field of image classification after training with a large number of labeled samples. However, there are many categories without or only with a few labeled training samples in practice, and some categories even have no training samples at all. The proposed zero-shot learning greatly reduces the dependence on labeled training samples for image classification models. Nevertheless, there are limitations in learning the similarity of visual features and semantic features with a predefined fixed metric (e.g., as Euclidean distance), as well as the problem of semantic gap in the mapping process. To address these problems, a new zero-shot image classification method based on an end-to-end learnable deep metric is proposed in this paper. First, the common space embedding is adopted to map the visual features and semantic features into a common space. Second, an end-to-end learnable deep metric, that is, the relation network is utilized to learn the similarity of visual features and semantic features. Finally, the invisible images are classified, according to the similarity score. Extensive experiments are carried out on four datasets and the results indicate the effectiveness of the proposed method.
Collapse
|
29
|
|
30
|
Shao H, Zhong D. One-shot cross-dataset palmprint recognition via adversarial domain adaptation. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.072] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
31
|
Xie C, Xiang H, Zeng T, Yang Y, Yu B, Liu Q. Cross Knowledge-based Generative Zero-Shot Learning approach with Taxonomy Regularization. Neural Netw 2021; 139:168-178. [PMID: 33721699 DOI: 10.1016/j.neunet.2021.02.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 01/17/2021] [Accepted: 02/08/2021] [Indexed: 10/22/2022]
Abstract
Although zero-shot learning (ZSL) has an inferential capability of recognizing new classes that have never been seen before, it always faces two fundamental challenges of the cross modality and cross-domain challenges. In order to alleviate these problems, we develop a generative network-based ZSL approach equipped with the proposed Cross Knowledge Learning (CKL) scheme and Taxonomy Regularization (TR). In our approach, the semantic features are taken as inputs, and the output is the synthesized visual features generated from the corresponding semantic features. CKL enables more relevant semantic features to be trained for semantic-to-visual feature embedding in ZSL, while Taxonomy Regularization (TR) significantly improves the intersections with unseen images with more generalized visual features generated from generative network. Extensive experiments on several benchmark datasets (i.e., AwA1, AwA2, CUB, NAB and aPY) show that our approach is superior to these state-of-the-art methods in terms of ZSL image classification and retrieval.
Collapse
Affiliation(s)
- Cheng Xie
- National Pilot School of Software, Yunnan University, Kunming 650091, China
| | - Hongxin Xiang
- National Pilot School of Software, Yunnan University, Kunming 650091, China
| | - Ting Zeng
- National Pilot School of Software, Yunnan University, Kunming 650091, China
| | - Yun Yang
- National Pilot School of Software, Yunnan University, Kunming 650091, China; Kunming Key Laboratory of Data Science and Intelligent Computing, Kunming 650500, China.
| | - Beibei Yu
- National Pilot School of Software, Yunnan University, Kunming 650091, China
| | - Qing Liu
- National Pilot School of Software, Yunnan University, Kunming 650091, China
| |
Collapse
|
32
|
|
33
|
Passalis N, Iosifidis A, Gabbouj M, Tefas A. Hypersphere-Based Weight Imprinting for Few-Shot Learning on Embedded Devices. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:925-930. [PMID: 32287012 DOI: 10.1109/tnnls.2020.2979745] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Weight imprinting (WI) was recently introduced as a way to perform gradient descent-free few-shot learning. Due to this, WI was almost immediately adapted for performing few-shot learning on embedded neural network accelerators that do not support back-propagation, e.g., edge tensor processing units. However, WI suffers from many limitations, e.g., it cannot handle novel categories with multimodal distributions and special care should be given to avoid overfitting the learned embeddings on the training classes since this can have a devastating effect on classification accuracy (for the novel categories). In this article, we propose a novel hypersphere-based WI approach that is capable of training neural networks in a regularized, imprinting-aware way effectively overcoming the aforementioned limitations. The effectiveness of the proposed method is demonstrated using extensive experiments on three image data sets.
Collapse
|
34
|
Zhao Y, Lai H, Yin J, Zhang Y, Yang S, Jia Z, Ma J. Zero-Shot Medical Image Retrieval for Emerging Infectious Diseases Based on Meta-Transfer Learning - Worldwide, 2020. China CDC Wkly 2020; 2:1004-1008. [PMID: 34594825 PMCID: PMC8422228 DOI: 10.46234/ccdcw2020.268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Accepted: 12/21/2020] [Indexed: 11/20/2022] Open
Affiliation(s)
- Yuying Zhao
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, Guangdong, China
| | - Hanjiang Lai
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, Guangdong, China
| | - Jian Yin
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, Guangdong, China
| | - Yewu Zhang
- Center for Public Health Surveillance and Information Service, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Shigui Yang
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Zhongwei Jia
- School of Public Health, Peking University, Beijing, China
| | - Jiaqi Ma
- Center for Public Health Surveillance and Information Service, Chinese Center for Disease Control and Prevention, Beijing, China
| |
Collapse
|
35
|
Deep Semantic-Preserving Reconstruction Hashing for Unsupervised Cross-Modal Retrieval. ENTROPY 2020; 22:e22111266. [PMID: 33287034 PMCID: PMC7712897 DOI: 10.3390/e22111266] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 11/02/2020] [Accepted: 11/05/2020] [Indexed: 11/16/2022]
Abstract
Deep hashing is the mainstream algorithm for large-scale cross-modal retrieval due to its high retrieval speed and low storage capacity, but the problem of reconstruction of modal semantic information is still very challenging. In order to further solve the problem of unsupervised cross-modal retrieval semantic reconstruction, we propose a novel deep semantic-preserving reconstruction hashing (DSPRH). The algorithm combines spatial and channel semantic information, and mines modal semantic information based on adaptive self-encoding and joint semantic reconstruction loss. The main contributions are as follows: (1) We introduce a new spatial pooling network module based on tensor regular-polymorphic decomposition theory to generate rank-1 tensor to capture high-order context semantics, which can assist the backbone network to capture important contextual modal semantic information. (2) Based on optimization perspective, we use global covariance pooling to capture channel semantic information and accelerate network convergence. In feature reconstruction layer, we use two bottlenecks auto-encoding to achieve visual-text modal interaction. (3) In metric learning, we design a new loss function to optimize model parameters, which can preserve the correlation between image modalities and text modalities. The DSPRH algorithm is tested on MIRFlickr-25K and NUS-WIDE. The experimental results show that DSPRH has achieved better performance on retrieval tasks.
Collapse
|
36
|
Zhang H, Liu J, Yao Y, Long Y. Pseudo distribution on unseen classes for generalized zero shot learning. Pattern Recognit Lett 2020. [DOI: 10.1016/j.patrec.2020.05.021] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
37
|
Ji Z, Chen K, Wang J, Yu Y, Zhang Z. Multi-modal generative adversarial network for zero-shot learning. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105847] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|