1
|
Tian J, Saddik AE, Xu X, Li D, Cao Z, Shen HT. Intrinsic Consistency Preservation With Adaptively Reliable Samples for Source-Free Domain Adaptation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4738-4749. [PMID: 38379234 DOI: 10.1109/tnnls.2024.3362948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/22/2024]
Abstract
Unsupervised domain adaptation (UDA) aims to alleviate the domain shift by transferring knowledge learned from a labeled source dataset to an unlabeled target domain. Although UDA has seen promising progress recently, it requires access to data from both domains, making it problematic in source data-absent scenarios. In this article, we investigate a practical task source-free domain adaptation (SFDA) that alleviates the limitations of the widely studied UDA in simultaneously acquiring source and target data. In addition, we further study the imbalanced SFDA (ISFDA) problem, which addresses the intra-domain class imbalance and inter-domain label shift in SFDA. We observe two key issues in SFDA that: 1) target data form clusters in the representation space regardless of whether the target data points are aligned with the source classifier and 2) target samples with higher classification confidence are more reliable and have less variation in their classification confidence during adaptation. Motivated by these observations, we propose a unified method, named intrinsic consistency preservation with adaptively reliable samples (ICPR), to jointly cope with SFDA and ISFDA. Specifically, ICPR first encourages the intrinsic consistency in the predictions of neighbors for unlabeled samples with weak augmentation (standard flip-and-shift), regardless of their reliability. ICPR then generates strongly augmented views specifically for adaptively selected reliable samples and is trained to fix the intrinsic consistency between weakly and strongly augmented views of the same image concerning predictions of neighbors and their own. Additionally, we propose to use a prototype-like classifier to avoid the classification confusion caused by severe intra-domain class imbalance and inter-domain label shift. We demonstrate the effectiveness and general applicability of ICPR on six benchmarks of both SFDA and ISFDA tasks. The reproducible code of our proposed ICPR method is available at https://github.com/CFM-MSG/Code_ICPR.
Collapse
|
2
|
Lai L, Chen J, Zhang Z, Lin G, Wu Q. CMFAN: Cross-Modal Feature Alignment Network for Few-Shot Single-View 3D Reconstruction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5522-5534. [PMID: 38593016 DOI: 10.1109/tnnls.2024.3383039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Few-shot single-view 3D reconstruction learns to reconstruct the novel category objects based on a query image and a few support shapes. However, since the query image and the support shapes are of different modalities, there is an inherent feature misalignment problem damaging the reconstruction. Previous works in the literature do not consider this problem. To this end, we propose the cross-modal feature alignment network (CMFAN) with two novel techniques. One is a strategy for model pretraining, namely, cross-modal contrastive learning (CMCL), here the 2D images and 3D shapes of the same objects compose the positives, and those from different objects form the negatives. With CMCL, the model learns to embed the 2D and 3D modalities of the same object into a tight area in the feature space and push away those from different objects, thus effectively aligning the global cross-modal features. The other is cross-modal feature fusion (CMFF), which further aligns and fuses the local features. Specifically, it first re-represents the local features with the cross-attention operation, making the local features share more information. Then, CMFF generates a descriptor for the support features and attaches it to each local feature vector of the query image with dense concatenation. Moreover, CMFF can be applied to multilevel local features and brings further advantages. We conduct extensive experiments to evaluate the effectiveness of our designs, and CMFAN sets new state-of-the-art performance in all of the 1-/10-/25-shot tasks of ShapeNet and ModelNet datasets.
Collapse
|
3
|
Zhang B, Zhang Y, Li J, Chen J, Akutsu T, Cheung YM, Cai H. Unsupervised Dual Deep Hashing With Semantic-Index and Content-Code for Cross-Modal Retrieval. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:387-399. [PMID: 39316491 DOI: 10.1109/tpami.2024.3467130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2024]
Abstract
Hashing technology has exhibited great cross-modal retrieval potential due to its appealing retrieval efficiency and storage effectiveness. Most current supervised cross-modal retrieval methods heavily rely on accurate semantic supervision, which is intractable for annotations with ever-growing sample sizes. By comparison, the existing unsupervised methods rely on accurate sample similarity preservation strategies with intensive computational costs to compensate for the lack of semantic guidance, which causes these methods to lose the power to bridge the semantic gap. Furthermore, both kinds of approaches need to search for the nearest samples among all samples in a large search space, whose process is laborious. To address these issues, this paper proposes an unsupervised dual deep hashing (UDDH) method with semantic-index and content-code for cross-modal retrieval. Deep hashing networks are utilized to extract deep features and jointly encode the dual hashing codes in a collaborative manner with a common semantic index and modality content codes to simultaneously bridge the semantic and heterogeneous gaps for cross-modal retrieval. The dual deep hashing architecture, comprising the head code on semantic index and tail codes on modality content, enhances the efficiency for cross-modal retrieval. A query sample only needs to search for the retrieved samples with the same semantic index, thus greatly shrinking the search space and achieving superior retrieval efficiency. UDDH integrates the learning processes of deep feature extraction, binary optimization, common semantic index, and modality content code within a unified model, allowing for collaborative optimization to enhance the overall performance. Extensive experiments are conducted to demonstrate the retrieval superiority of the proposed approach over the state-of-the-art baselines.
Collapse
|
4
|
Yang Y, Xi W, Zhou L, Tang J. Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; PP:6881-6892. [PMID: 40030596 DOI: 10.1109/tip.2024.3518759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.
Collapse
|
5
|
Text-based person search via local-relational-global fine grained alignment. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
6
|
Dai Y, Liu A, Chen M, Liu Y, Yao Y. Enhanced Soft Sensor with Qualified Augmented Samples for Quality Prediction of the Polyethylene Process. Polymers (Basel) 2022; 14:polym14214769. [PMID: 36365761 PMCID: PMC9656800 DOI: 10.3390/polym14214769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2022] [Revised: 11/04/2022] [Accepted: 11/04/2022] [Indexed: 11/09/2022] Open
Abstract
Data-driven soft sensors have increasingly been applied for the quality measurement of industrial polymerization processes in recent years. However, owing to the costly assay process, the limited labeled data available still pose significant obstacles to the construction of accurate models. In this study, a novel soft sensor named the selective Wasserstein generative adversarial network, with gradient penalty-based support vector regression (SWGAN-SVR), is proposed to enhance quality prediction with limited training samples. Specifically, the Wasserstein generative adversarial network with gradient penalty (WGAN-GP) is employed to capture the distribution of the available limited labeled data and to generate virtual candidates. Subsequently, an effective data-selection strategy is developed to alleviate the problem of varied-quality samples caused by the unstable training of the WGAN-GP. The selection strategy includes two parts: the centroid metric criterion and the statistical characteristic criterion. An SVR model is constructed based on the qualified augmented training data to evaluate the prediction performance. The superiority of SWGAN-SVR is demonstrated, using a numerical example and an industrial polyethylene process.
Collapse
Affiliation(s)
- Yun Dai
- Institute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Angpeng Liu
- Institute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Meng Chen
- Guangdong Basic and Applied Basic Research Foundation, Guangzhou 510640, China
| | - Yi Liu
- Institute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou 310023, China
- Correspondence: (Y.L.); (Y.Y.); Tel.: +886-3-5713690 (Y.Y.)
| | - Yuan Yao
- Department of Chemical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan
- Correspondence: (Y.L.); (Y.Y.); Tel.: +886-3-5713690 (Y.Y.)
| |
Collapse
|
7
|
Wei J, Yang Y, Xu X, Zhu X, Shen HT. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:6534-6545. [PMID: 34125668 DOI: 10.1109/tpami.2021.3088863] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Cross-modal retrieval has recently attracted growing attention, which aims to match instances captured from different modalities. The performance of cross-modal retrieval methods heavily relies on the capability of metric learning to mine and weight the informative pairs. While various metric learning methods have been developed for unimodal retrieval tasks, the cross-modal retrieval tasks, however, have not been explored to its fullest extent. In this paper, we develop a universal weighting metric learning framework for cross-modal retrieval, which can effectively sample informative pairs and assign proper weight values to them based on their similarity scores so that different pairs favor different penalty strength. Based on this framework, we introduce two types of polynomial loss for cross-modal retrieval, self-similarity polynomial loss and relative-similarity polynomial loss. The former provides a polynomial function to associate the weight values with self-similarity scores, and the latter defines a polynomial function to associate the weight values with relative-similarity scores. Both self and relative-similarity polynomial loss can be freely applied to off-the-shelf methods and further improve their retrieval performance. Extensive experiments on two image-text retrieval datasets, three video-text retrieval datasets and one fine-grained image retrieval dataset demonstrate that our proposed method can achieve a noticeable boost in retrieval performance.
Collapse
|
8
|
Shu Z, Yong K, Yu J, Gao S, Mao C, Yu Z. Discrete asymmetric zero-shot hashing with application to cross-modal retrieval. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
9
|
Yang X, Wang S, Dong J, Dong J, Wang M, Chua TS. Video Moment Retrieval With Cross-Modal Neural Architecture Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1204-1216. [PMID: 35015640 DOI: 10.1109/tip.2022.3140611] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The task of video moment retrieval (VMR) is to retrieve the specific video moment from an untrimmed video, according to a textual query. It is a challenging task that requires effective modeling of complex cross-modal matching relationship. Recent efforts primarily model the cross-modal interactions by hand-crafted network architectures. Despite their effectiveness, they rely heavily on expert experience to select architectures and have numerous hyperparameters that need to be carefully tuned, which significantly limit their applications in real-world scenarios. How to design flexible architectures for modeling cross-modal interactions with less manual effort is crucial for the task of VMR but has received limited attention so far. To address this issue, we present a novel VMR approach that automatically searches for an optimal architecture to learn cross-modal matching relationship. Specifically, we develop a cross-modal architecture searching method. It first searches for repeatable cell network architectures based on a directed acyclic graph, which performs operation sampling over a customized task-specific operation set. Then, we adaptively modulate the edge importance in the graph by a query-aware attention network, which performs edge sampling softly in the searched cell. Different from existing neural architecture search methods, our approach can effectively exploit the query information to reach query-conditioned architectures for modeling cross modal matching. Extensive experiments on three benchmark datasets show that our approach can not only significantly outperform the state-of-the-art methods but also run more efficiently and robustly than manually crafted network architectures.
Collapse
|