1
|
Lai L, Chen J, Zhang Z, Lin G, Wu Q. CMFAN: Cross-Modal Feature Alignment Network for Few-Shot Single-View 3D Reconstruction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5522-5534. [PMID: 38593016 DOI: 10.1109/tnnls.2024.3383039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Few-shot single-view 3D reconstruction learns to reconstruct the novel category objects based on a query image and a few support shapes. However, since the query image and the support shapes are of different modalities, there is an inherent feature misalignment problem damaging the reconstruction. Previous works in the literature do not consider this problem. To this end, we propose the cross-modal feature alignment network (CMFAN) with two novel techniques. One is a strategy for model pretraining, namely, cross-modal contrastive learning (CMCL), here the 2D images and 3D shapes of the same objects compose the positives, and those from different objects form the negatives. With CMCL, the model learns to embed the 2D and 3D modalities of the same object into a tight area in the feature space and push away those from different objects, thus effectively aligning the global cross-modal features. The other is cross-modal feature fusion (CMFF), which further aligns and fuses the local features. Specifically, it first re-represents the local features with the cross-attention operation, making the local features share more information. Then, CMFF generates a descriptor for the support features and attaches it to each local feature vector of the query image with dense concatenation. Moreover, CMFF can be applied to multilevel local features and brings further advantages. We conduct extensive experiments to evaluate the effectiveness of our designs, and CMFAN sets new state-of-the-art performance in all of the 1-/10-/25-shot tasks of ShapeNet and ModelNet datasets.
Collapse
|
2
|
Wang Z, Gao Z, Yang Y, Wang G, Jiao C, Shen HT. Geometric Matching for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5509-5521. [PMID: 38652629 DOI: 10.1109/tnnls.2024.3381347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Despite its significant progress, cross-modal retrieval still suffers from one-to-many matching cases, where the multiplicity of semantic instances in another modality could be acquired by a given query. However, existing approaches usually map heterogeneous data into the learned space as deterministic point vectors. In spite of their remarkable performance in matching the most similar instance, such deterministic point embedding suffers from the insufficient representation of rich semantics in one-to-many correspondence. To address the limitations, we intuitively extend a deterministic point into a closed geometry and develop geometric representation learning methods for cross-modal retrieval. Thus, a set of points inside such a geometry could be semantically related to many candidates, and we could effectively capture the semantic uncertainty. We then introduce two types of geometric matching for one-to-many correspondence, i.e., point-to-rectangle matching (dubbed P2RM) and rectangle-to-rectangle matching (termed R2RM). The former treats all retrieved candidates as rectangles with zero volume (equivalent to points) and the query as a box, while the latter encodes all heterogeneous data into rectangles. Therefore, we could evaluate semantic similarity among heterogeneous data by the Euclidean distance from a point to a rectangle or the volume of intersection between two rectangles. Additionally, both strategies could be easily employed for off-the-self approaches and further improve the retrieval performance of baselines. Under various evaluation metrics, extensive experiments and ablation studies on several commonly used datasets, two for image-text matching and two for video-text retrieval, demonstrate our effectiveness and superiority.
Collapse
|
3
|
Shao Z, Han J, Marnerides D, Debattista K. Region-Object Relation-Aware Dense Captioning via Transformer. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4184-4195. [PMID: 35275824 DOI: 10.1109/tnnls.2022.3152990] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Dense captioning provides detailed captions of complex visual scenes. While a number of successes have been achieved in recent years, there are still two broad limitations: 1) most existing methods adopt an encoder-decoder framework, where the contextual information is sequentially encoded using long short-term memory (LSTM). However, the forget gate mechanism of LSTM makes it vulnerable when dealing with a long sequence and 2) the vast majority of prior arts consider regions of interests (RoIs) equally important, thus failing to focus on more informative regions. The consequence is that the generated captions cannot highlight important contents of the image, which does not seem natural. To overcome these limitations, in this article, we propose a novel end-to-end transformer-based dense image captioning architecture, termed the transformer-based dense captioner (TDC). TDC learns the mapping between images and their dense captions via a transformer, prioritizing more informative regions. To this end, we present a novel unit, named region-object correlation score unit (ROCSU), to measure the importance of each region, where the relationships between detected objects and the region, alongside the confidence scores of detected objects within the region, are taken into account. Extensive experimental results and ablation studies on the standard dense-captioning datasets demonstrate the superiority of the proposed method to the state-of-the-art methods.
Collapse
|
4
|
Liu Y, Liu H, Wang H, Meng F, Liu M. BCAN: Bidirectional Correct Attention Network for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14247-14258. [PMID: 37256811 DOI: 10.1109/tnnls.2023.3276796] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
As a fundamental topic in bridging the gap between vision and language, cross-modal retrieval purposes to obtain the correspondences' relationship between fragments, i.e., subregions in images and words in texts. Compared with earlier methods that focus on learning the visual semantic embedding from images and sentences to the shared embedding space, the existing methods tend to learn the correspondences between words and regions via cross-modal attention. However, such attention-based approaches invariably result in semantic misalignment between subfragments for two reasons: 1) without modeling the relationship between subfragments and the semantics of the entire images or sentences, it will be hard for such approaches to distinguish images or sentences with multiple same semantic fragments and 2) such approaches focus attention evenly on all subfragments, including nonvisual words and a lot of redundant regions, which also will face the problem of semantic misalignment. To solve these problems, this article proposes a bidirectional correct attention network (BCAN), which introduces a novel concept of the relevance between subfragments and the semantics of the entire images or sentences and designs a novel correct attention mechanism by modeling the local and global similarity between images and sentences to correct the attention weights focused on the wrong fragments. Specifically, we introduce a concept about the semantic relationship between subfragments and entire images or sentences and use this concept to solve the semantic misalignment from two aspects. In our correct attention mechanism, we design two independent units to correct the weight of attention focused on the wrong fragments. Global correct unit (GCU) with modeling the global similarity between images and sentences into the attention mechanism to solve the semantic misalignment problem caused by focusing attention on relevant subfragments in irrelevant pairs (RI) and the local correct unit (LCU) consider the difference in the attention weights between fragments among two steps to solve the semantic misalignment problem caused by focusing attention on irrelevant subfragments in relevant pairs (IR). Extensive experiments on large-scale MS-COCO and Flickr30K show that our proposed method outperforms all the attention-based methods and is competitive to the state-of-the-art. Our code and pretrained model are publicly available at: https://github.com/liuyyy111/BCAN.
Collapse
|
5
|
Yan X, Mao Y, Ye Y, Yu H. Cross-Modal Clustering With Deep Correlated Information Bottleneck Method. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13508-13522. [PMID: 37220062 DOI: 10.1109/tnnls.2023.3269789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Cross-modal clustering (CMC) intends to improve the clustering accuracy (ACC) by exploiting the correlations across modalities. Although recent research has made impressive advances, it remains a challenge to sufficiently capture the correlations across modalities due to the high-dimensional nonlinear characteristics of individual modalities and the conflicts in heterogeneous modalities. In addition, the meaningless modality-private information in each modality might become dominant in the process of correlation mining, which also interferes with the clustering performance. To tackle these challenges, we devise a novel deep correlated information bottleneck (DCIB) method, which aims at exploring the correlation information between multiple modalities while eliminating the modality-private information in each modality in an end-to-end manner. Specifically, DCIB treats the CMC task as a two-stage data compression procedure, in which the modality-private information in each modality is eliminated under the guidance of the shared representation of multiple modalities. Meanwhile, the correlations between multiple modalities are preserved from the aspects of feature distributions and clustering assignments simultaneously. Finally, the objective of DCIB is formulated as an objective function based on a mutual information measurement, in which a variational optimization approach is proposed to ensure its convergence. Experimental results on four cross-modal datasets validate the superiority of the DCIB. Code is released at https://github.com/Xiaoqiang-Yan/DCIB.
Collapse
|
6
|
Tao R, Zhu M, Cao H, Ren H. Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective. SENSORS (BASEL, SWITZERLAND) 2024; 24:3130. [PMID: 38793984 PMCID: PMC11125332 DOI: 10.3390/s24103130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 05/03/2024] [Accepted: 05/11/2024] [Indexed: 05/26/2024]
Abstract
Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image-text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image-text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image-text conservation area image datasets.
Collapse
Affiliation(s)
- Rui Tao
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China;
- College of Artificial Intelligence and Big Data, Hulunbuir University, Hulunbuir 021008, China;
| | - Meng Zhu
- College of Information Engineering, Harbin University, Harbin 150076, China;
| | - Haiyan Cao
- College of Artificial Intelligence and Big Data, Hulunbuir University, Hulunbuir 021008, China;
| | - Honge Ren
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China;
- Heilongjiang Forestry Intelligent Equipment Engineering Research Center, Harbin 150040, China
| |
Collapse
|
7
|
Jiang X, Xu X, Zhang J, Shen F, Cao Z, Shen HT. SDN: Semantic Decoupling Network for Temporal Language Grounding. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:6598-6612. [PMID: 36264718 DOI: 10.1109/tnnls.2022.3211850] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Temporal language grounding (TLG) is one of the most challenging cross-modal video understanding tasks, which aims at retrieving the most relevant video segment from an untrimmed video according to a natural language sentence. The existing methods can be separated into two dominant types: 1) proposal-based and 2) proposal-free methods, where the former conduct contextual interactions and the latter localizes timestamps flexibly. However, the constant-scale candidates in proposal-based methods limit the localization precision and bring extra computational costs. In contrast, the proposal-free methods perform well on high-precision metrics-based on the fine-grained features but suffer from a lack of coarse-grained interactions, which cause degeneration when the video becomes complex. In this article, we propose a novel framework termed semantic decoupling network (SDN) that combines the advantages of proposal-based and proposal-free methods and overcomes their defects. It contains three key components: 1) semantic decoupling module (SDM); 2) context modeling block (CMB); and 3) semantic cross-level aggregation module (SCAM). By capturing the video-text contexts in multilevel semantics, the SDM and CMB effectively utilize the benefits of proposal-based methods. Meanwhile, the SCAM maintains the merit of proposal-free methods in that it localizes timestamps precisely. The experiments on three challenge datasets, i.e., Charades-STA, TACoS, and ActivityNet-Caption, show that our proposed SDN method significantly outperforms recent state-of-the-art methods, especially the proposal-free methods. Extensive analyses, as well as the implementation code of the proposed SDN method, are provided at https://github.com/CFM-MSG/Code_SDN.
Collapse
|
8
|
Ma X, Yang M, Li Y, Hu P, Lv J, Peng X. Cross-Modal Retrieval With Noisy Correspondence via Consistency Refining and Mining. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:2587-2598. [PMID: 38507381 DOI: 10.1109/tip.2024.3374221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/22/2024]
Abstract
The success of existing cross-modal retrieval (CMR) methods heavily rely on the assumption that the annotated cross-modal correspondence is faultless. In practice, however, the correspondence of some pairs would be inevitably contaminated during data collection or annotation, thus leading to the so-called Noisy Correspondence (NC) problem. To alleviate the influence of NC, we propose a novel method termed Consistency REfining And Mining (CREAM) by revealing and exploiting the difference between correspondence and consistency. Specifically, the correspondence and the consistency only be coincident for true positive and true negative pairs, while being distinct for false positive and false negative pairs. Based on the observation, CREAM employs a collaborative learning paradigm to detect and rectify the correspondence of positives, and a negative mining approach to explore and utilize the consistency. Thanks to the consistency refining and mining strategy of CREAM, the overfitting on the false positives could be prevented and the consistency rooted in the false negatives could be exploited, thus leading to a robust CMR method. Extensive experiments verify the effectiveness of our method on three image-text benchmarks including Flickr30K, MS-COCO, and Conceptual Captions. Furthermore, we adopt our method into the graph matching task and the results demonstrate the robustness of our method against fine-grained NC problem. The code is available on https://github.com/XLearning-SCU/2024-TIP-CREAM.
Collapse
|
9
|
Peng SJ, He Y, Liu X, Cheung YM, Xu X, Cui Z. Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image-Text Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2194-2207. [PMID: 35830398 DOI: 10.1109/tnnls.2022.3188569] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Fine-grained image-text retrieval has been a hot research topic to bridge the vision and languages, and its main challenge is how to learn the semantic correspondence across different modalities. The existing methods mainly focus on learning the global semantic correspondence or intramodal relation correspondence in separate data representations, but which rarely consider the intermodal relation that interactively provide complementary hints for fine-grained semantic correlation learning. To address this issue, we propose a relation-aggregated cross-graph (RACG) model to explicitly learn the fine-grained semantic correspondence by aggregating both intramodal and intermodal relations, which can be well utilized to guide the feature correspondence learning process. More specifically, we first build semantic-embedded graph to explore both fine-grained objects and their relations of different media types, which aim not only to characterize the object appearance in each modality, but also to capture the intrinsic relation information to differentiate intramodal discrepancies. Then, a cross-graph relation encoder is newly designed to explore the intermodal relation across different modalities, which can mutually boost the cross-modal correlations to learn more precise intermodal dependencies. Besides, the feature reconstruction module and multihead similarity alignment are efficiently leveraged to optimize the node-level semantic correspondence, whereby the relation-aggregated cross-modal embeddings between image and text are discriminatively obtained to benefit various image-text retrieval tasks with high retrieval performance. Extensive experiments evaluated on benchmark datasets quantitatively and qualitatively verify the advantages of the proposed framework for fine-grained image-text retrieval and show its competitive performance with the state of the arts.
Collapse
|
10
|
Liu H, Guo Y, Yin J, Gao Z, Nie L. Review Polarity-Wise Recommender. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:10039-10050. [PMID: 35427224 DOI: 10.1109/tnnls.2022.3163789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The de facto review-involved recommender systems, using review information to enhance recommendation, have received increasing interest over the past years. Thereinto, one advanced branch is to extract salient aspects from textual reviews (i.e., the item attributes that users express) and combine them with the matrix factorization (MF) technique. However, the existing approaches all ignore the fact that semantically different reviews often include opposite aspect information. In particular, positive reviews usually express aspects that users prefer, while the negative ones describe aspects that users dislike. As a result, it may mislead the recommender systems into making incorrect decisions pertaining to user preference modeling. Toward this end, in this article, we present a review polarity-wise recommender model, dubbed as RPR, to discriminately treat reviews with different polarities. To be specific, in this model, positive and negative reviews are separately gathered and used to model the user-preferred and user-rejected aspects, respectively. Besides, to overcome the imbalance of semantically different reviews, we further develop an aspect-aware importance weighting strategy to align the aspect importance for these two kinds of reviews. Extensive experiments conducted on eight benchmark datasets have demonstrated the superiority of our model when compared with several state-of-the-art review-involved baselines. Moreover, our method can provide certain explanations to real-world rating prediction scenarios.
Collapse
|
11
|
Wu J, Wang L, Chen C, Lu J, Wu C. Multi-View Inter-Modality Representation with Progressive Fusion for Image-Text Matching. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2023.02.043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2023]
|
12
|
Unifying knowledge iterative dissemination and relational reconstruction network for image–text matching. Inf Process Manag 2023. [DOI: 10.1016/j.ipm.2022.103154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
13
|
Wang X, Du Y, Verberne S, Verbeek FJ. Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04259-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
14
|
Qi Q, Li K, Zheng H, Gao X, Hou G, Sun K. SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; PP:6816-6830. [PMID: 36288230 DOI: 10.1109/tip.2022.3216208] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Due to the wavelength-dependent light attenuation, refraction and scattering, underwater images usually suffer from color distortion and blurred details. However, due to the limited number of paired underwater images with undistorted images as reference, training deep enhancement models for diverse degradation types is quite difficult. To boost the performance of data-driven approaches, it is essential to establish more effective learning mechanisms that mine richer supervised information from limited training sample resources. In this paper, we propose a novel underwater image enhancement network, called SGUIE-Net, in which we introduce semantic information as high-level guidance via region-wise enhancement feature learning. Accordingly, we propose semantic region-wise enhancement module to better learn local enhancement features for semantic regions with multi-scale perception. After using them as complementary features and feeding them to the main branch, which extracts the global enhancement features on the original image scale, the fused features bring semantically consistent and visually superior enhancements. Extensive experiments on the publicly available datasets and our proposed dataset demonstrate the impressive performance of SGUIE-Net. The code and proposed dataset are available at https://trentqq.github.io/SGUIE-Net.html.
Collapse
|
15
|
Wei J, Yang Y, Xu X, Zhu X, Shen HT. Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:6534-6545. [PMID: 34125668 DOI: 10.1109/tpami.2021.3088863] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Cross-modal retrieval has recently attracted growing attention, which aims to match instances captured from different modalities. The performance of cross-modal retrieval methods heavily relies on the capability of metric learning to mine and weight the informative pairs. While various metric learning methods have been developed for unimodal retrieval tasks, the cross-modal retrieval tasks, however, have not been explored to its fullest extent. In this paper, we develop a universal weighting metric learning framework for cross-modal retrieval, which can effectively sample informative pairs and assign proper weight values to them based on their similarity scores so that different pairs favor different penalty strength. Based on this framework, we introduce two types of polynomial loss for cross-modal retrieval, self-similarity polynomial loss and relative-similarity polynomial loss. The former provides a polynomial function to associate the weight values with self-similarity scores, and the latter defines a polynomial function to associate the weight values with relative-similarity scores. Both self and relative-similarity polynomial loss can be freely applied to off-the-shelf methods and further improve their retrieval performance. Extensive experiments on two image-text retrieval datasets, three video-text retrieval datasets and one fine-grained image retrieval dataset demonstrate that our proposed method can achieve a noticeable boost in retrieval performance.
Collapse
|
16
|
Fan W, Liang C, Wang T. Contrastive semantic disentanglement in latent space for generalized zero-shot learning. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
17
|
Yang F, Ding X, Liu Y, Ma F, Cao J. Scalable semantic-enhanced supervised hashing for cross-modal retrieval. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
18
|
Visual context learning based on textual knowledge for image–text retrieval. Neural Netw 2022; 152:434-449. [DOI: 10.1016/j.neunet.2022.05.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 04/28/2022] [Accepted: 05/10/2022] [Indexed: 11/21/2022]
|
19
|
Zeng L, Li H, Xiao T, Shen F, Zhong Z. Graph convolutional network with sample and feature weights for Alzheimer’s disease diagnosis. Inf Process Manag 2022. [DOI: 10.1016/j.ipm.2022.102952] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
20
|
Guo Y, Gao L, Song J, Wang P, Sebe N, Shen HT, Li X. Relation Regularized Scene Graph Generation. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:5961-5972. [PMID: 33710964 DOI: 10.1109/tcyb.2021.3052522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations for describing the image content abstraction. Existing works have revealed that if the links between objects are given as prior knowledge, the performance of SGG is significantly improved. Inspired by this observation, in this article, we propose a relation regularized network (R2-Net), which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG. Specifically, we first construct an affinity matrix among detected objects to represent the probability of a relationship between two objects. Graph convolution networks (GCNs) over this relation affinity matrix are then used as object encoders, producing relation-regularized representations of objects. With these relation-regularized features, our R2-Net can effectively refine object labels and generate scene graphs. Extensive experiments are conducted on the visual genome dataset for three SGG tasks (i.e., predicate classification, scene graph classification, and scene graph detection), demonstrating the effectiveness of our proposed method. Ablation studies also verify the key roles of our proposed components in performance improvement.
Collapse
|
21
|
Convolutional Neural Network-Based Cross-Media Semantic Matching and User Adaptive Satisfaction Analysis Model. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:4244675. [PMID: 35535181 PMCID: PMC9078763 DOI: 10.1155/2022/4244675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 03/28/2022] [Accepted: 03/31/2022] [Indexed: 11/29/2022]
Abstract
In this paper, an in-depth study of cross-media semantic matching and user adaptive satisfaction analysis model is carried out based on the convolutional neural network. Based on the existing convolutional neural network, this paper uses rich information. The spatial correlation of cross-media semantic matching further improves the classification accuracy of hyperspectral images and reduces the classification time under user adaptive satisfaction complexity. Aiming at the problem that it is difficult for the current hyperspectral image classification method based on convolutional neural network to capture the spatial pose characteristics of objects, the problem is that principal component analysis ignores some vital information when retaining a few components. This paper proposes a polymorphism based on extension Attribute Profile Feature (EMAP) Stereo Capsule Network Model for Hyperspectral Image Classification. To ensure the model has good generalization performance, a new remote sensing image Pan sharpening algorithm based on convolutional neural network is proposed, which increases the model's width to extract the feature information of the image and uses dilated instead of traditional convolution. The experimental results show that the algorithm has good generalization while ensuring self-adaptive satisfaction.
Collapse
|
22
|
Liang J, Yang C, Zeng M, Wang X. TransConver: transformer and convolution parallel network for developing automatic brain tumor segmentation in MRI images. Quant Imaging Med Surg 2022; 12:2397-2415. [PMID: 35371952 PMCID: PMC8923874 DOI: 10.21037/qims-21-919] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 01/04/2022] [Indexed: 07/28/2023]
Abstract
Background Medical image segmentation plays a vital role in computer-aided diagnosis (CAD) systems. Both convolutional neural networks (CNNs) with strong local information extraction capacities and transformers with excellent global representation capacities have achieved remarkable performance in medical image segmentation. However, because of the semantic differences between local and global features, how to combine convolution and transformers effectively is an important challenge in medical image segmentation. Methods In this paper, we proposed TransConver, a U-shaped segmentation network based on convolution and transformer for automatic and accurate brain tumor segmentation in MRI images. Unlike the recently proposed transformer and convolution based models, we proposed a parallel module named transformer-convolution inception (TC-inception), which extracts local and global information via convolution blocks and transformer blocks, respectively, and integrates them by a cross-attention fusion with global and local feature (CAFGL) mechanism. Meanwhile, the improved skip connection structure named skip connection with cross-attention fusion (SCCAF) mechanism can alleviate the semantic differences between encoder features and decoder features for better feature fusion. In addition, we designed 2D-TransConver and 3D-TransConver for 2D and 3D brain tumor segmentation tasks, respectively, and verified the performance and advantage of our model through brain tumor datasets. Results We trained our model on 335 cases from the training dataset of MICCAI BraTS2019 and evaluated the model's performance based on 66 cases from MICCAI BraTS2018 and 125 cases from MICCAI BraTS2019. Our TransConver achieved the best average Dice score of 83.72% and 86.32% on BraTS2019 and BraTS2018, respectively. Conclusions We proposed a transformer and convolution parallel network named TransConver for brain tumor segmentation. The TC-Inception module effectively extracts global information while retaining local details. The experimental results demonstrated that good segmentation requires the model to extract local fine-grained details and global semantic information simultaneously, and our TransConver effectively improves the accuracy of brain tumor segmentation.
Collapse
|
23
|
Guan X, Yang Y, Li J, Xu X, Shen HT. Mind the Remainder: Taylor's Theorem View on Recurrent Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:1507-1519. [PMID: 33444144 DOI: 10.1109/tnnls.2020.3042537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recurrent neural networks (RNNs) have gained tremendous popularity in almost every sequence modeling task. Despite the effort, these kinds of discrete unstructured data, such as texts, audio, and videos, are still difficult to be embedded in the feature space. Studies in improving the neural networks have accelerated since the introduction of more complex or deeper architectures. The improvements of previous methods are highly dependent on the model at the expense of huge computational sources. However, few of them pay attention to the algorithm. In this article, we bridge the Taylor series with the construction of RNN. Training RNN can be considered as a parameter estimate for the Taylor series. However, we found that there is a discrete term called the remainder in the finite Taylor series that cannot be optimized using gradient descent, which is part of the reason for the truncation error and the model falling into the local optimal solution. To address this, we propose a training algorithm that estimates the range of remainder and introduces the remainder obtained by sampling in this continuous space into the RNN to assist in optimizing the parameters. Notably, the performance of RNN can be improved without changing the RNN architecture in the testing phase. We demonstrate that our approach is able to achieve state-of-the-art performance in action recognition and cross-modal retrieval tasks.
Collapse
|
24
|
A Quantum Language-Inspired Tree Structural Text Representation for Semantic Analysis. MATHEMATICS 2022. [DOI: 10.3390/math10060914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Text representation is an important topic in the field of natural language processing, which can effectively transfer knowledge to downstream tasks. To extract effective semantic information from text with unsupervised methods, this paper proposes a quantum language-inspired tree structural text representation model to study the correlations between words with variable distance for semantic analysis. Combining the different semantic contributions of associated words in different syntax trees, a syntax tree-based attention mechanism is established to highlight the semantic contributions of non-adjacent associated words and weaken the semantic weight of adjacent non-associated words. Moreover, the tree-based attention mechanism includes not only the overall information of entangled words in the dictionary but also the local grammatical structure of word combinations in different sentences. Experimental results on semantic textual similarity tasks show that the proposed method obtains significant performances over the state-of-the-art sentence embeddings.
Collapse
|
25
|
Zhen L, Hu P, Peng X, Goh RSM, Zhou JT. Deep Multimodal Transfer Learning for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:798-810. [PMID: 33090960 DOI: 10.1109/tnnls.2020.3029181] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Cross-modal retrieval (CMR) enables flexible retrieval experience across different modalities (e.g., texts versus images), which maximally benefits us from the abundance of multimedia data. Existing deep CMR approaches commonly require a large amount of labeled data for training to achieve high performance. However, it is time-consuming and expensive to annotate the multimedia data manually. Thus, how to transfer valuable knowledge from existing annotated data to new data, especially from the known categories to new categories, becomes attractive for real-world applications. To achieve this end, we propose a deep multimodal transfer learning (DMTL) approach to transfer the knowledge from the previously labeled categories (source domain) to improve the retrieval performance on the unlabeled new categories (target domain). Specifically, we employ a joint learning paradigm to transfer knowledge by assigning a pseudolabel to each target sample. During training, the pseudolabel is iteratively updated and passed through our model in a self-supervised manner. At the same time, to reduce the domain discrepancy of different modalities, we construct multiple modality-specific neural networks to learn a shared semantic space for different modalities by enforcing the compactness of homoinstance samples and the scatters of heteroinstance samples. Our method is remarkably different from most of the existing transfer learning approaches. To be specific, previous works usually assume that the source domain and the target domain have the same label set. In contrast, our method considers a more challenging multimodal learning situation where the label sets of the two domains are different or even disjoint. Experimental studies on four widely used benchmarks validate the effectiveness of the proposed method in multimodal transfer learning and demonstrate its superior performance in CMR compared with 11 state-of-the-art methods.
Collapse
|
26
|
Abstract
Cross-modal retrieval aims to search samples of one modality via queries of other modalities, which is a hot issue in the community of multimedia. However, two main challenges, i.e., heterogeneity gap and semantic interaction across different modalities, have not been solved efficaciously. Reducing the heterogeneous gap can improve the cross-modal similarity measurement. Meanwhile, modeling cross-modal semantic interaction can capture the semantic correlations more accurately. To this end, this paper presents a novel end-to-end framework, called Dual Attention Generative Adversarial Network (DA-GAN). This technique is an adversarial semantic representation model with a dual attention mechanism, i.e., intra-modal attention and inter-modal attention. Intra-modal attention is used to focus on the important semantic feature within a modality, while inter-modal attention is to explore the semantic interaction between different modalities and then represent the high-level semantic correlation more precisely. A dual adversarial learning strategy is designed to generate modality-invariant representations, which can reduce the cross-modal heterogeneity efficiently. The experiments on three commonly used benchmarks show the better performance of DA-GAN than these competitors.
Collapse
|
27
|
Peng L, Yang Y, Wang Z, Huang Z, Shen HT. MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:318-329. [PMID: 32750794 DOI: 10.1109/tpami.2020.3004830] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network (MRA-Net). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two question-adaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both kinds of question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposed model also combines appearance feature with relation feature to reconcile the two types of features effectively. Extensive experiments on five large benchmark datasets, VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC, demonstrate that our proposed model outperforms state-of-the-art approaches.
Collapse
|
28
|
|
29
|
Cai S, Li P, Su E, Xie L. Auditory Attention Detection via Cross-Modal Attention. Front Neurosci 2021; 15:652058. [PMID: 34366770 PMCID: PMC8333999 DOI: 10.3389/fnins.2021.652058] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 06/24/2021] [Indexed: 11/13/2022] Open
Abstract
Humans show a remarkable perceptual ability to select the speech stream of interest among multiple competing speakers. Previous studies demonstrated that auditory attention detection (AAD) can infer which speaker is attended by analyzing a listener's electroencephalography (EEG) activities. However, previous AAD approaches perform poorly on short signal segments, more advanced decoding strategies are needed to realize robust real-time AAD. In this study, we propose a novel approach, i.e., cross-modal attention-based AAD (CMAA), to exploit the discriminative features and the correlation between audio and EEG signals. With this mechanism, we hope to dynamically adapt the interactions and fuse cross-modal information by directly attending to audio and EEG features, thereby detecting the auditory attention activities manifested in brain signals. We also validate the CMAA model through data visualization and comprehensive experiments on a publicly available database. Experiments show that the CMAA achieves accuracy values of 82.8, 86.4, and 87.6% for 1-, 2-, and 5-s decision windows under anechoic conditions, respectively; for a 2-s decision window, it achieves an average of 84.1% under real-world reverberant conditions. The proposed CMAA network not only achieves better performance than the conventional linear model, but also outperforms the state-of-the-art non-linear approaches. These results and data visualization suggest that the CMAA model can dynamically adapt the interactions and fuse cross-modal information by directly attending to audio and EEG features in order to improve the AAD performance.
Collapse
Affiliation(s)
| | | | | | - Longhan Xie
- Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, China
| |
Collapse
|
30
|
|
31
|
Zhang J, Shen F, Xu X, Shen HT. Temporal Reasoning Graph for Activity Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:5491-5506. [PMID: 32286981 DOI: 10.1109/tip.2020.2985219] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Despite great success has been achieved in activity analysis, it still has many challenges. Most existing works in activity recognition pay more attention to designing efficient architecture or video sampling strategy. However, due to the property of fine-grained action and long term structure in video, activity recognition is expected to reason temporal relation between video sequences. In this paper, we propose an efficient temporal reasoning graph (TRG) to simultaneously capture the appearance features and temporal relation between video sequences at multiple time scales. Specifically, we construct learnable temporal relation graphs to explore temporal relation on the multi-scale range. Additionally, to facilitate multi-scale temporal relation extraction, we design a multi-head temporal adjacent matrix to represent multi-kinds of temporal relations. Eventually, a multi-head temporal relation aggregator is proposed to extract the semantic meaning of those features convolving through the graphs. Extensive experiments are performed on widely-used large-scale datasets, such as Something-Something, Charades and Jester, and the results show that our model can achieve stateof- the-art performance. Further analysis shows that temporal relation reasoning with our TRG can extract discriminative features for activity recognition.
Collapse
|