1
|
Wang Z, Gao Z, Yang Y, Wang G, Jiao C, Shen HT. Geometric Matching for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5509-5521. [PMID: 38652629 DOI: 10.1109/tnnls.2024.3381347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Despite its significant progress, cross-modal retrieval still suffers from one-to-many matching cases, where the multiplicity of semantic instances in another modality could be acquired by a given query. However, existing approaches usually map heterogeneous data into the learned space as deterministic point vectors. In spite of their remarkable performance in matching the most similar instance, such deterministic point embedding suffers from the insufficient representation of rich semantics in one-to-many correspondence. To address the limitations, we intuitively extend a deterministic point into a closed geometry and develop geometric representation learning methods for cross-modal retrieval. Thus, a set of points inside such a geometry could be semantically related to many candidates, and we could effectively capture the semantic uncertainty. We then introduce two types of geometric matching for one-to-many correspondence, i.e., point-to-rectangle matching (dubbed P2RM) and rectangle-to-rectangle matching (termed R2RM). The former treats all retrieved candidates as rectangles with zero volume (equivalent to points) and the query as a box, while the latter encodes all heterogeneous data into rectangles. Therefore, we could evaluate semantic similarity among heterogeneous data by the Euclidean distance from a point to a rectangle or the volume of intersection between two rectangles. Additionally, both strategies could be easily employed for off-the-self approaches and further improve the retrieval performance of baselines. Under various evaluation metrics, extensive experiments and ablation studies on several commonly used datasets, two for image-text matching and two for video-text retrieval, demonstrate our effectiveness and superiority.
Collapse
|
2
|
Chen W, Wang Y, Tang X, Yan P, Liu X, Lin L, Shi G, Robert E, Huang F. A specific fine-grained identification model for plasma-treated rice growth using multiscale shortcut convolutional neural network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:10223-10243. [PMID: 37322930 DOI: 10.3934/mbe.2023448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
As an agricultural innovation, low-temperature plasma technology is an environmentally friendly green technology that increases crop quality and productivity. However, there is a lack of research on the identification of plasma-treated rice growth. Although traditional convolutional neural networks (CNN) can automatically share convolution kernels and extract features, the outputs are only suitable for entry-level categorization. Indeed, shortcuts from the bottom layers to fully connected layers can be established feasibly in order to utilize spatial and local information from the bottom layers, which contain small distinctions necessary for fine-grain identification. In this work, 5000 original images which contain the basic growth information of rice (including plasma treated rice and the control rice) at the tillering stage were collected. An efficient multiscale shortcut CNN (MSCNN) model utilizing key information and cross-layer features was proposed. The results show that MSCNN outperforms the mainstream models in terms of accuracy, recall, precision and F1 score with 92.64%, 90.87%, 92.88% and 92.69%, respectively. Finally, the ablation experiment, comparing the average precision of MSCNN with and without shortcuts, revealed that the MSCNN with three shortcuts achieved the best performance with the highest precision.
Collapse
Affiliation(s)
- Wenzhuo Chen
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Yuan Wang
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Xiaojiang Tang
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Pengfei Yan
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Xin Liu
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Lianfeng Lin
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Guannan Shi
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Eric Robert
- GREMI, UMR 7344, CNRS/Université d'Orléans, 45067 Orléans Cedex France
| | - Feng Huang
- College of Science, China Agricultural University, Beijing 100083, China
- GREMI, UMR 7344, CNRS/Université d'Orléans, 45067 Orléans Cedex France
- LE STUDIUM Loire Valley Institute for Advanced Studies, Centre-Val de Loire region, France
| |
Collapse
|
3
|
Jing T, Xia H, Hamm J, Ding Z. Augmented Multimodality Fusion for Generalized Zero-Shot Sketch-Based Visual Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3657-3668. [PMID: 35576409 DOI: 10.1109/tip.2022.3173815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Zero-shot sketch-based image retrieval (ZS-SBIR) has attracted great attention recently, due to the potential application of sketch-based retrieval under zero-shot scenarios, where the categories of query sketches and gallery photos are not observed in the training stage. However, it is still under insufficient exploration for the general and practical scenario when the query sketches and gallery photos contain both seen and unseen categories. Such a problem is defined as generalized zero-shot sketch-based image retrieval (GZS-SBIR), which is the focus of this work. To this end, we propose a novel Augmented Multi-modality Fusion (AMF) framework to generalize seen concepts to unobserved ones efficiently. Specifically, a novel knowledge discovery module named cross-domain augmentation is designed in both visual and semantic space to mimic novel knowledge unseen from the training stage, which is the key to handling the GZS-SBIR challenge. Moreover, a triplet domain alignment module is proposed to couple the cross-domain distribution between photo and sketch in visual space. To enhance the robustness of our model, we explore embedding propagation to refine both visual and semantic features by removing undesired noise. Eventually, visual-semantic fusion representations are concatenated for further domain discrimination and task-specific recognition, which tend to trigger the cross-domain alignment in both visual and semantic feature space. Experimental evaluations are conducted on popular ZS-SBIR benchmarks as well as a new evaluation protocol designed for GZS-SBIR from DomainNet dataset with more diverse sub-domains, and the promising results demonstrate the superiority of the proposed solution over other baselines. The source code is available at https://github.com/scottjingtt/AMF_GZS_SBIR.git.
Collapse
|
4
|
Wang C, Luo Z, Lin Y, Li S. Improving embedding learning by virtual attribute decoupling for text-based person search. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06734-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|