1
|
Ma Y, Ji J, Sun X, Zhou Y, Hong X, Wu Y, Ji R. Image Captioning via Dynamic Path Customization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:6203-6217. [PMID: 39083387 DOI: 10.1109/tnnls.2024.3409354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/02/2024]
Abstract
This article explores a novel dynamic network for vision and language (V&L) tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art (SOTA) approaches are static and handcrafted networks, which not only heavily rely on expert knowledge but also ignore the semantic diversity of input samples, therefore resulting in suboptimal performance. To address these issues, we propose a novel Dynamic Transformer Network (DTNet) for image captioning, which dynamically assigns customized paths to different samples, leading to discriminative yet accurate captions. Specifically, to build a rich routing space and improve routing efficiency, we introduce five types of basic cells and group them into two separate routing spaces according to their operating domains, i.e., spatial and channel. Then, we design a Spatial-Channel Joint Router (SCJR), which endows the model with the capability of path customization based on both spatial and channel information of the input sample. To validate the effectiveness of our proposed DTNet, we conduct extensive experiments on the MS-COCO dataset and achieve new SOTA performance on both the Karpathy split and the online test server. The source code is publicly available at https://github.com/xmu-xiaoma666/DTNet.
Collapse
|
2
|
Huang Z, Li W, Xia XG, Wang H, Tao R. Task-Wise Sampling Convolutions for Arbitrary-Oriented Object Detection in Aerial Images. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5204-5218. [PMID: 38412084 DOI: 10.1109/tnnls.2024.3367331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/29/2024]
Abstract
Arbitrary-oriented object detection (AOOD) has been widely applied to locate and classify objects with diverse orientations in remote sensing images. However, the inconsistent features for the localization and classification tasks in AOOD models may lead to ambiguity and low-quality object predictions, which constrains the detection performance. In this article, an AOOD method called task-wise sampling convolutions (TS-Conv) is proposed. TS-Conv adaptively samples task-wise features from respective sensitive regions and maps these features together in alignment to guide a dynamic label assignment for better predictions. Specifically, sampling positions of the localization convolution in TS-Conv are supervised by the oriented bounding box (OBB) prediction associated with spatial coordinates, while sampling positions and convolutional kernel of the classification convolution are designed to be adaptively adjusted according to different orientations for improving the orientation robustness of features. Furthermore, a dynamic task-consistent-aware label assignment (DTLA) strategy is developed to select optimal candidate positions and assign labels dynamically according to ranked task-aware scores obtained from TS-Conv. Extensive experiments on several public datasets covering multiple scenes, multimodal images, and multiple categories of objects demonstrate the effectiveness, scalability, and superior performance of the proposed TS-Conv.
Collapse
|
3
|
Fan J, Huang L, Gong C, You Y, Gan M, Wang Z. KMT-PLL: K-Means Cross-Attention Transformer for Partial Label Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2789-2800. [PMID: 38194387 DOI: 10.1109/tnnls.2023.3347792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
Partial label learning (PLL) studies the problem of learning instance classification with a set of candidate labels and only one is correct. While recent works have demonstrated that the Vision Transformer (ViT) has achieved good results when training from clean data, its applications to PLL remain limited and challenging. To address this issue, we rethink the relationship between instances and object queries to propose K-means cross-attention transformer for PLL (KMT-PLL), which can continuously learn cluster centers and be used for downstream disambiguation tasks. More specifically, K-means cross-attention as a clustering process can effectively learn the cluster centers to represent label classes. The purpose of this operation is to make the similarity between instances and labels measurable, which can effectively detect noise labels. Furthermore, we propose a new corrected cross entropy formulation, which can assign weights to candidate labels according to the instance-to-label relevance to guide the training of the instance classifier. As the training goes on, the ground-truth label is progressively identified, and the refined labels and cluster centers in turn help to improve the classifier. Simulation results demonstrate the advantage of the KMT-PLL and its suitability for PLL.
Collapse
|
4
|
Xu Z, Xu W, Wang R, Chen J, Qi C, Lukasiewicz T. Hybrid Reinforced Medical Report Generation With M-Linear Attention and Repetition Penalty. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2206-2220. [PMID: 38145508 DOI: 10.1109/tnnls.2023.3343391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2023]
Abstract
To reduce doctors' workload, deep-learning-based automatic medical report generation has recently attracted more and more research efforts, where deep convolutional neural networks (CNNs) are employed to encode the input images, and recurrent neural networks (RNNs) are used to decode the visual features into medical reports automatically. However, these state-of-the-art methods mainly suffer from three shortcomings: 1) incomprehensive optimization; 2) low-order and unidimensional attention; and 3) repeated generation. In this article, we propose a hybrid reinforced medical report generation method with m-linear attention and repetition penalty mechanism (HReMRG-MR) to overcome these problems. Specifically, a hybrid reward with different weights is employed to remedy the limitations of single-metric-based rewards, and a local optimal weight search algorithm is proposed to significantly reduce the complexity of searching the weights of the rewards from exponential to linear. Furthermore, we use m-linear attention modules to learn multidimensional high-order feature interactions and to achieve multimodal reasoning, while a new repetition penalty is proposed to apply penalties to repeated terms adaptively during the model's training process. Extensive experimental studies on two public benchmark datasets show that HReMRG-MR greatly outperforms the state-of-the-art baselines in terms of all metrics. The effectiveness and necessity of all components in HReMRG-MR are also proved by ablation studies. Additional experiments are further conducted and the results demonstrate that our proposed local optimal weight search algorithm can significantly reduce the search time while maintaining superior medical report generation performances.
Collapse
|
5
|
Liu J, Tan H, Hu Y, Sun Y, Wang H, Yin B. Global and Local Interactive Perception Network for Referring Image Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17754-17767. [PMID: 37695953 DOI: 10.1109/tnnls.2023.3308550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/13/2023]
Abstract
The effective modal fusion and perception between the language and the image are necessary for inferring the reference instance in the referring image segmentation (RIS) task. In this article, we propose a novel RIS network, the global and local interactive perception network (GLIPN), to enhance the quality of modal fusion between the language and the image from the local and global perspectives. The core of GLIPN is the global and local interactive perception (GLIP) scheme. Specifically, the GLIP scheme contains the local perception module (LPM) and the global perception module (GPM). The LPM is designed to enhance the local modal fusion by the correspondence between word and image local semantics. The GPM is designed to inject the global structured semantics of images into the modal fusion process, which can better guide the word embedding to perceive the whole image's global structure. Combined with the local-global context semantics fusion, extensive experiments on several benchmark datasets demonstrate the advantage of the proposed GLIPN over most state-of-the-art approaches.
Collapse
|
6
|
Chen G, Wang M, Zhang Q, Yuan L, Yue Y. Full Transformer Framework for Robust Point Cloud Registration With Deep Information Interaction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13368-13382. [PMID: 37163402 DOI: 10.1109/tnnls.2023.3267333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Point cloud registration is an essential technology in computer vision and robotics. Recently, transformer-based methods have achieved advanced performance in point cloud registration by utilizing the advantages of the transformer in order-invariance and modeling dependencies to aggregate information. However, they still suffer from indistinct feature extraction, sensitivity to noise, and outliers, owing to three major limitations: 1) the adoption of CNNs fails to model global relations due to their local receptive fields, resulting in extracted features susceptible to noise; 2) the shallow-wide architecture of transformers and the lack of positional information lead to indistinct feature extraction due to inefficient information interaction; and 3) the insufficient consideration of geometrical compatibility leads to the ambiguous identification of incorrect correspondences. To address the above-mentioned limitations, a novel full transformer network for point cloud registration is proposed, named the deep interaction transformer (DIT), which incorporates: 1) a point cloud structure extractor (PSE) to retrieve structural information and model global relations with the local feature integrator (LFI) and transformer encoders; 2) a deep-narrow point feature transformer (PFT) to facilitate deep information interaction across a pair of point clouds with positional information, such that transformers establish comprehensive associations and directly learn the relative position between points; and 3) a geometric matching-based correspondence confidence evaluation (GMCCE) method to measure spatial consistency and estimate correspondence confidence by the designed triangulated descriptor. Extensive experiments on the ModelNet40, ScanObjectNN, and 3DMatch datasets demonstrate that our method is capable of precisely aligning point clouds, consequently, achieving superior performance compared with state-of-the-art methods. The code is publicly available at https://github.com/CGuangyan-BIT/DIT.
Collapse
|
7
|
Ma J, Bai Y, Zhong B, Zhang W, Yao T, Mei T. Visualizing and Understanding Patch Interactions in Vision Transformer. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13671-13680. [PMID: 37224360 DOI: 10.1109/tnnls.2023.3270479] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Vision transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of ViT, and there is no clear picture of how the attention mechanism with respect to the correlation across comprehensive patches will impact the performance and what is the further potential. In this work, we propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for ViT. Specifically, we first introduce a quantification indicator to measure the impact of patch interaction and verify such quantification on attention window design and indiscriminative patches removal. Then, we exploit the effective responsive field of each patch in ViT and devise a window-free transformer (WinfT) architecture accordingly. Extensive experiments on ImageNet demonstrate that the exquisitely designed quantitative method is shown able to facilitate ViT model learning, leading the top-1 accuracy by 4.28% at most. More remarkably, the results on downstream fine-grained recognition tasks further validate the generalization of our proposal.
Collapse
|
8
|
Wang H, Wang W, Li W, Liu H. Dense captioning and multidimensional evaluations for indoor robotic scenes. Front Neurorobot 2023; 17:1280501. [PMID: 38034836 PMCID: PMC10682356 DOI: 10.3389/fnbot.2023.1280501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 10/30/2023] [Indexed: 12/02/2023] Open
Abstract
The field of human-computer interaction is expanding, especially within the domain of intelligent technologies. Scene understanding, which entails the generation of advanced semantic descriptions from scene content, is crucial for effective interaction. Despite its importance, it remains a significant challenge. This study introduces RGBD2Cap, an innovative method that uses RGBD images for scene semantic description. We utilize a multimodal fusion module to integrate RGB and Depth information for extracting multi-level features. And the method also incorporates target detection and region proposal network and a top-down attention LSTM network to generate semantic descriptions. The experimental data are derived from the ScanRefer indoor scene dataset, with RGB and depth images rendered from ScanNet's 3D scene serving as the model's input. The method outperforms the DenseCap network in several metrics, including BLEU, CIDEr, and METEOR. Ablation studies have confirmed the essential role of the RGBD fusion module in the method's success. Furthermore, the practical applicability of our method was verified within the AI2-THOR embodied intelligence experimental environment, showcasing its reliability.
Collapse
Affiliation(s)
- Hua Wang
- Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, Shenzhen, China
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Wenshuai Wang
- Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Wenhao Li
- Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Hong Liu
- Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, Shenzhen, China
| |
Collapse
|
9
|
Li L, Zhi T, Shi G, Yang Y, Xu L, Li Y, Guo Y. Anchor-based Knowledge Embedding for Image Aesthetics Assessment. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2023.03.058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
10
|
Si T, He F, Li P, Gao X. Tri-modality consistency optimization with heterogeneous augmented images for visible-infrared person re-identification. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2022.12.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
11
|
Cao X, Li C, Feng J, Jiao L. Semi-supervised feature learning for disjoint hyperspectral imagery classification. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2023.01.054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|