1
|
Krishnan C, Onuoha E, Hung A, Sung KH, Kim H. Multi-attention Mechanism for Enhanced Pseudo-3D Prostate Zonal Segmentation. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025:10.1007/s10278-025-01401-0. [PMID: 40021566 DOI: 10.1007/s10278-025-01401-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Revised: 12/15/2024] [Accepted: 12/31/2024] [Indexed: 03/03/2025]
Abstract
This study presents a novel pseudo-3D Global-Local Channel Spatial Attention (GLCSA) mechanism designed to enhance prostate zonal segmentation in high-resolution T2-weighted MRI images. GLCSA captures complex, multi-dimensional features while maintaining computational efficiency by integrating global and local attention in channel and spatial domains, complemented by a slice interaction module simulating 3D processing. Applied across various U-Net architectures, GLCSA was evaluated on two datasets: a proprietary set of 44 patients and the public ProstateX dataset of 204 patients. Performance, measured using the Dice Similarity Coefficient (DSC) and Mean Surface Distance (MSD) metrics, demonstrated significant improvements in segmentation accuracy for both the transition zone (TZ) and peripheral zone (PZ), with minimal parameter increase (1.27%). GLCSA achieved DSC increases of 0.74% and 11.75% for TZ and PZ, respectively, in the proprietary dataset. In the ProstateX dataset, improvements were even more pronounced, with DSC increases of 7.34% for TZ and 24.80% for PZ. Comparative analysis showed GLCSA-UNet performing competitively against other 2D, 2.5D, and 3D models, with DSC values of 0.85 (TZ) and 0.65 (PZ) on the proprietary dataset and 0.80 (TZ) and 0.76 (PZ) on the ProstateX dataset. Similarly, MSD values were 1.14 (TZ) and 1.21 (PZ) on the proprietary dataset and 1.48 (TZ) and 0.98 (PZ) on the ProstateX dataset. Ablation studies highlighted the effectiveness of combining channel and spatial attention and the advantages of global embedding over patch-based methods. In conclusion, GLCSA offers a robust balance between the detailed feature capture of 3D models and the efficiency of 2D models, presenting a promising tool for improving prostate MRI image segmentation.
Collapse
Affiliation(s)
- Chetana Krishnan
- Department of Biomedical Engineering, The University of Alabama at Birmingham, Birmingham, AL, 35294, USA
| | - Ezinwanne Onuoha
- Department of Biomedical Engineering, The University of Alabama at Birmingham, Birmingham, AL, 35294, USA
| | - Alex Hung
- Department of Radiology, The University of California Los Angeles, Los Angeles, CA, 90404, USA
| | - Kyung Hyun Sung
- Department of Radiology, The University of California Los Angeles, Los Angeles, CA, 90404, USA
| | - Harrison Kim
- Department of Radiology, The University of Alabama at Birmingham, 1720 2Nd Avenue South, VH G082, Birmingham, AL, 35294, USA.
| |
Collapse
|
2
|
Li Y, Qi T, Ma Z, Quan D, Miao Q. Seeking a Hierarchical Prototype for Multimodal Gesture Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:198-209. [PMID: 37494175 DOI: 10.1109/tnnls.2023.3295811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/28/2023]
Abstract
Gesture recognition has drawn considerable attention from many researchers owing to its wide range of applications. Although significant progress has been made in this field, previous works always focus on how to distinguish between different gesture classes, ignoring the influence of inner-class divergence caused by gesture-irrelevant factors. Meanwhile, for multimodal gesture recognition, feature or score fusion in the final stage is a general choice to combine the information of different modalities. Consequently, the gesture-relevant features in different modalities may be redundant, whereas the complementarity of modalities is not exploited sufficiently. To handle these problems, we propose a hierarchical gesture prototype framework to highlight gesture-relevant features such as poses and motions in this article. This framework consists of a sample-level prototype and a modal-level prototype. The sample-level gesture prototype is established with the structure of a memory bank, which avoids the distraction of gesture-irrelevant factors in each sample, such as the illumination, background, and the performers' appearances. Then the modal-level prototype is obtained via a generative adversarial network (GAN)-based subnetwork, in which the modal-invariant features are extracted and pulled together. Meanwhile, the modal-specific attribute features are used to synthesize the feature of other modalities, and the circulation of modality information helps to leverage their complementarity. Extensive experiments on three widely used gesture datasets demonstrate that our method is effective to highlight gesture-relevant features and can outperform the state-of-the-art methods.
Collapse
|
3
|
Wang M, Xing J, Mei J, Liu Y, Jiang Y. ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:625-637. [PMID: 37988204 DOI: 10.1109/tnnls.2023.3331841] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2023]
Abstract
The canonical approach to video action recognition dictates a neural network model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferability on new datasets with unseen concepts. In this article, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters' requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, adapt and fine-tune." This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then, it makes the action recognition task to act more like pre-training problems via adaptation engineering. Finally, it is fine-tuned end-to-end on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git.
Collapse
|
4
|
Wang Q, Hu Q, Gao Z, Li P, Hu Q. AMS-Net: Modeling Adaptive Multi-Granularity Spatio-Temporal Cues for Video Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18731-18745. [PMID: 37824318 DOI: 10.1109/tnnls.2023.3321141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/14/2023]
Abstract
Effective spatio-temporal modeling as a core of video representation learning is challenged by complex scale variations in spatio-temporal cues in videos, especially different visual tempos of actions and varying spatial sizes of moving objects. Most of the existing works handle complex spatio-temporal scale variations based on input-level or feature-level pyramid mechanisms, which, however, rely on expensive multistream architectures or explore multiscale spatio-temporal features in a fixed manner. To effectively capture complex scale dynamics of spatio-temporal cues in an efficient way, this article proposes a single-stream architecture (SS-Arch.) with single-input [namely, adaptive multi-granularity spatio-temporal network (AMS-Net)] to model adaptive multi-granularity (Multi-Gran.) Spatio-temporal cues for video action recognition. To this end, our AMS-Net proposes two core components, namely, competitive progressive temporal modeling (CPTM) block and collaborative spatio-temporal pyramid (CSTP) module. They, respectively, capture fine-grained temporal cues and fuse coarse-level spatio-temporal features in an adaptive manner. It admits that AMS-Net can handle subtle variations in visual tempos and fair-sized spatio-temporal dynamics in a unified architecture. Note that our AMS-Net can be flexibly instantiated based on existing deep convolutional neural networks (CNNs) with the proposed CPTM block and CSTP module. The experiments are conducted on eight video benchmarks, and the results show our AMS-Net establishes state-of-the-art (SOTA) performance on fine-grained action recognition (i.e., Diving48 and FineGym), while performing very competitively on widely used Something-Something and Kinetics.
Collapse
|
5
|
Li G, Cheng D, Ding X, Wang N, Li J, Gao X. Weakly Supervised Temporal Action Localization With Bidirectional Semantic Consistency Constraint. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13032-13045. [PMID: 37134038 DOI: 10.1109/tnnls.2023.3266062] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Weakly supervised temporal action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classification problem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be suboptimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene (i.e., the scene same as positive actions) as co-scene actions, this suboptimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC first adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video. Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet. The code is available at https://github.com/lgzlIlIlI/BiSCC.
Collapse
|
6
|
Liu Y, Zhang Y, Wang Y, Hou F, Yuan J, Tian J, Zhang Y, Shi Z, Fan J, He Z. A Survey of Visual Transformers. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7478-7498. [PMID: 37015131 DOI: 10.1109/tnnls.2022.3227717] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern convolution neural networks (CNNs). In this survey, we have reviewed over 100 of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios. Because of their differences on training settings and dedicated vision tasks, we have also evaluated and compared all these existing visual Transformers under different configurations. Furthermore, we have revealed a series of essential but unexploited aspects that may empower such visual Transformers to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between the visual Transformers and the sequential ones. Finally, two promising research directions are suggested for future investment. We will continue to update the latest articles and their released source codes at https://github.com/liuyang-ict/awesome-visual-transformers.
Collapse
|
7
|
Song Q, Li J, Guo H, Huang R. Denoised Non-Local Neural Network for Semantic Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7162-7174. [PMID: 37021852 DOI: 10.1109/tnnls.2022.3214216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
The non-local (NL) network has become a widely used technique for semantic segmentation, which computes an attention map to measure the relationships of each pixel pair. However, most of the current popular NL models tend to ignore the phenomenon that the calculated attention map appears to be very noisy, containing interclass and intraclass inconsistencies, which lowers the accuracy and reliability of the NL methods. In this article, we figuratively denote these inconsistencies as attention noises and explore the solutions to denoise them. Specifically, we inventively propose a denoised NL network, which consists of two primary modules, i.e., the global rectifying (GR) block and the local retention (LR) block, to eliminate the interclass and intraclass noises, respectively. First, GR adopts the class-level predictions to capture a binary map to distinguish whether the selected two pixels belong to the same category. Second, LR captures the ignored local dependencies and further uses them to rectify the unwanted hollows in the attention map. The experimental results on two challenging semantic segmentation datasets demonstrate the superior performance of our model. Without any external training data, our proposed denoised NL can achieve the state-of-the-art performance of 83.5% and 46.69% mean of classwise intersection over union (mIoU) on Cityscapes and ADE20K, respectively.
Collapse
|
8
|
Liu T, Zhao R, Jia W, Lam KM, Kong J. Holistic-Guided Disentangled Learning With Cross-Video Semantics Mining for Concurrent First-Person and Third-Person Activity Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5211-5225. [PMID: 36094995 DOI: 10.1109/tnnls.2022.3202835] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The popularity of wearable devices has increased the demands for the research on first-person activity recognition. However, most of the current first-person activity datasets are built based on the assumption that only the human-object interaction (HOI) activities, performed by the camera-wearer, are captured in the field of view. Since humans live in complicated scenarios, in addition to the first-person activities, it is likely that third-person activities performed by other people also appear. Analyzing and recognizing these two types of activities simultaneously occurring in a scene is important for the camera-wearer to understand the surrounding environments. To facilitate the research on concurrent first- and third-person activity recognition (CFT-AR), we first created a new activity dataset, namely PolyU concurrent first- and third-person (CFT) Daily, which exhibits distinct properties and challenges, compared with previous activity datasets. Since temporal asynchronism and appearance gap usually exist between the first- and third-person activities, it is crucial to learn robust representations from all the activity-related spatio-temporal positions. Thus, we explore both holistic scene-level and local instance-level (person-level) features to provide comprehensive and discriminative patterns for recognizing both first- and third-person activities. On the one hand, the holistic scene-level features are extracted by a 3-D convolutional neural network, which is trained to mine shared and sample-unique semantics between video pairs, via two well-designed attention-based modules and a self-knowledge distillation (SKD) strategy. On the other hand, we further leverage the extracted holistic features to guide the learning of instance-level features in a disentangled fashion, which aims to discover both spatially conspicuous patterns and temporally varied, yet critical, cues. Experimental results on the PolyU CFT Daily dataset validate that our method achieves the state-of-the-art performance.
Collapse
|
9
|
Lei Q, Lu F. Global domain adaptation attention with data-dependent regulator for scene segmentation. PLoS One 2024; 19:e0295263. [PMID: 38354116 PMCID: PMC10866527 DOI: 10.1371/journal.pone.0295263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Accepted: 11/19/2023] [Indexed: 02/16/2024] Open
Abstract
Most semantic segmentation works have obtained accurate segmentation results through exploring the contextual dependencies. However, there are several major limitations that need further investigation. For example, most approaches rarely distinguish different types of contextual dependencies, which may pollute the scene understanding. Moreover, local convolutions are commonly used in deep learning models to learn attention and capture local patterns in the data. These convolutions operate on a small neighborhood of the input, focusing on nearby information and disregarding global structural patterns. To address these concerns, we propose a Global Domain Adaptation Attention with Data-Dependent Regulator (GDAAR) method to explore the contextual dependencies. Specifically, to effectively capture both the global distribution information and local appearance details, we suggest using a stacked relation approach. This involves incorporating the feature node itself and its pairwise affinities with all other feature nodes within the network, arranged in raster scan order. By doing so, we can learn a global domain adaptation attention mechanism. Meanwhile, to improve the features similarity belonging to the same segment region while keeping the discriminative power of features belonging to different segments, we design a data-dependent regulator to adjust the global domain adaptation attention on the feature map during inference. Extensive ablation studies demonstrate that our GDAAR better captures the global distribution information for the contextual dependencies and achieves the state-of-the-art performance on several popular benchmarks.
Collapse
Affiliation(s)
- Qiuyuan Lei
- School of Economics and Management, University of Science and Technology Beijing, Beijing, China
| | - Fei Lu
- Institute of Information Engineering, Nanyang Vocational College of Agriculture, Nanyang, China
| |
Collapse
|
10
|
Zhou XH, Xie XL, Liu SQ, Ni ZL, Zhou YJ, Li RQ, Gui MJ, Fan CC, Feng ZQ, Bian GB, Hou ZG. Learning Skill Characteristics From Manipulations. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9727-9741. [PMID: 35333726 DOI: 10.1109/tnnls.2022.3160159] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Percutaneous coronary intervention (PCI) has increasingly become the main treatment for coronary artery disease. The procedure requires high experienced skills and dexterous manipulations. However, there are few techniques to model PCI skill so far. In this study, a learning framework with local and ensemble learning is proposed to learn skill characteristics of different skill-level subjects from their PCI manipulations. Ten interventional cardiologists (four experts and six novices) were recruited to deliver a medical guidewire to two target arteries on a porcine model for in vivo studies. Simultaneously, translation and twist manipulations of thumb, forefinger, and wrist are acquired with electromagnetic (EM) and fiber-optic bend (FOB) sensors, respectively. These behavior data are then processed with wavelet packet decomposition (WPD) under 1-10 levels for feature extraction. The feature vectors are further fed into three candidate individual classifiers in the local learning layer. Furthermore, the local learning results from different manipulation behaviors are fused in the ensemble learning layer with three rule-based ensemble learning algorithms. In subject-dependent skill characteristics learning, the ensemble learning can achieve 100% accuracy, significantly outperforming the best local result (90%). Furthermore, ensemble learning can also maintain 73% accuracy in subject-independent schemes. These promising results demonstrate the great potential of the proposed method to facilitate skill learning in surgical robotics and skill assessment in clinical practice.
Collapse
|
11
|
Han H, Zhang Q, Li F, Du Y. Spatial oblivion channel attention targeting intra-class diversity feature learning. Neural Netw 2023; 167:10-21. [PMID: 37619510 DOI: 10.1016/j.neunet.2023.07.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 07/21/2023] [Accepted: 07/24/2023] [Indexed: 08/26/2023]
Abstract
Convolutional neural networks (CNNs) have successfully driven many visual recognition tasks including image classification. However, when dealing with classification tasks with intra-class sample style diversity, the network tends to be disturbed by more diverse features, resulting in limited feature learning. In this article, a spatial oblivion channel attention (SOCA) for intra-class diversity feature learning is proposed. Specifically, SOCA performs spatial structure oblivion in a progressive regularization for each channel after convolution, so that the network is not restricted to a limited feature learning, and pays attention to more regionally detailed features. Further, SOCA reassigns channel weights in the progressively oblivious feature space from top to bottom along the channel direction, to ensure the network learns more image details in an orderly manner while not falling into feature redundancy. Experiments are conducted on the standard classification dataset CIFAR-10/100 and two garbage datasets with intra-class diverse styles. SOCA improves SqueezeNet, MobileNet, BN-VGG-19, Inception and ResNet-50 in classification accuracy by 1.31%, 1.18%, 1.57%, 2.09% and 2.27% on average, respectively. The feasibility and effectiveness of intra-class diversity feature learning in SOCA-enhanced networks are verified. Besides, the class activation map shows that more local detail feature regions are activated by adding the SOCA module, which also demonstrates the interpretability of the method for intra-class diversity feature learning.
Collapse
Affiliation(s)
- Honggui Han
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Engineering Research Center of Digital Community Ministry of Education, Beijing University of Technology, Beijing 100124, China; Beijing Artificial Intelligence Institute and Beijing Laboratory for Intelligent Environmental Protection, Beijing 100124, China.
| | - Qiyu Zhang
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China
| | - Fangyu Li
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China; Engineering Research Center of Digital Community Ministry of Education, Beijing University of Technology, Beijing 100124, China; Beijing Artificial Intelligence Institute and Beijing Laboratory for Intelligent Environmental Protection, Beijing 100124, China
| | - Yongping Du
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing 100124, China
| |
Collapse
|
12
|
Ullah H, Munir A. Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. J Imaging 2023; 9:130. [PMID: 37504807 PMCID: PMC10381293 DOI: 10.3390/jimaging9070130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 06/13/2023] [Accepted: 06/19/2023] [Indexed: 07/29/2023] Open
Abstract
Vision-based human activity recognition (HAR) has emerged as one of the essential research areas in video analytics. Over the last decade, numerous advanced deep learning algorithms have been introduced to recognize complex human actions from video streams. These deep learning algorithms have shown impressive performance for the video analytics task. However, these newly introduced methods either exclusively focus on model performance or the effectiveness of these models in terms of computational efficiency, resulting in a biased trade-off between robustness and computational efficiency in their proposed methods to deal with challenging HAR problem. To enhance both the accuracy and computational efficiency, this paper presents a computationally efficient yet generic spatial-temporal cascaded framework that exploits the deep discriminative spatial and temporal features for HAR. For efficient representation of human actions, we propose an efficient dual attentional convolutional neural network (DA-CNN) architecture that leverages a unified channel-spatial attention mechanism to extract human-centric salient features in video frames. The dual channel-spatial attention layers together with the convolutional layers learn to be more selective in the spatial receptive fields having objects within the feature maps. The extracted discriminative salient features are then forwarded to a stacked bi-directional gated recurrent unit (Bi-GRU) for long-term temporal modeling and recognition of human actions using both forward and backward pass gradient learning. Extensive experiments are conducted on three publicly available human action datasets, where the obtained results verify the effectiveness of our proposed framework (DA-CNN+Bi-GRU) over the state-of-the-art methods in terms of model accuracy and inference runtime across each dataset. Experimental results show that the DA-CNN+Bi-GRU framework attains an improvement in execution time up to 167× in terms of frames per second as compared to most of the contemporary action-recognition methods.
Collapse
Affiliation(s)
- Hayat Ullah
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA
| | - Arslan Munir
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA
| |
Collapse
|
13
|
Zheng Z, An G, Cao S, Wu D, Ruan Q. Collaborative and Multilevel Feature Selection Network for Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:1304-1318. [PMID: 34424850 DOI: 10.1109/tnnls.2021.3105184] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The feature pyramid has been widely used in many visual tasks, such as fine-grained image classification, instance segmentation, and object detection, and had been achieving promising performance. Although many algorithms exploit different-level features to construct the feature pyramid, they usually treat them equally and do not make an in-depth investigation on the inherent complementary advantages of different-level features. In this article, to learn a pyramid feature with the robust representational ability for action recognition, we propose a novel collaborative and multilevel feature selection network (FSNet) that applies feature selection and aggregation on multilevel features according to action context. Unlike previous works that learn the pattern of frame appearance by enhancing spatial encoding, the proposed network consists of the position selection module and channel selection module that can adaptively aggregate multilevel features into a new informative feature from both position and channel dimensions. The position selection module integrates the vectors at the same spatial location across multilevel features with positionwise attention. Similarly, the channel selection module selectively aggregates the channel maps at the same channel location across multilevel features with channelwise attention. Positionwise features with different receptive fields and channelwise features with different pattern-specific responses are emphasized respectively depending on their correlations to actions, which are fused as a new informative feature for action recognition. The proposed FSNet can be inserted into different backbone networks flexibly, and extensive experiments are conducted on three benchmark action datasets, Kinetics, UCF101, and HMDB51. Experimental results show that FSNet is practical and can be collaboratively trained to boost the representational ability of existing networks. FSNet achieves superior performance against most top-tier models on Kinetics and all models on UCF101 and HMDB51.
Collapse
|
14
|
Cui M, Wang W, Zhang K, Sun Z, Wang L. Pose-Appearance Relational Modeling for Video Action Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 32:295-308. [PMID: 37015555 DOI: 10.1109/tip.2022.3228156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modeling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.
Collapse
|
15
|
Hao Y, Wang S, Tan Y, He X, Liu Z, Wang M. Spatio-Temporal Collaborative Module for Efficient Action Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:7279-7291. [PMID: 36378789 DOI: 10.1109/tip.2022.3221292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Efficient action recognition aims to classify a video clip into a specific action category with a low computational cost. It is challenging since the integrated spatial-temporal calculation (e. g., 3D convolution) introduces intensive operations and increases complexity. This paper explores the feasibility of the integration of channel splitting and filter decoupling for efficient architecture design and feature refinement by proposing a novel spatio-temporal collaborative (STC) module. STC splits the video feature channels into two groups and separately learns spatio-temporal representations in parallel with decoupled convolutional operators. Particularly, STC consists of two computation-efficient blocks, i.e., [Formula: see text] and [Formula: see text], where they extract either spatial ( S· ) or temporal ( T· ) features and further refine their features with either temporal ( ·T ) or spatial ( ·S ) contexts globally. The spatial/temporal context refers to information dynamics aggregated from temporal/spatial axis. To thoroughly examine our method's performance in video action recognition tasks, we conduct extensive experiments using five video benchmark datasets requiring temporal reasoning. Experimental results show that the proposed STC networks achieve a competitive trade-off between model efficiency and effectiveness.
Collapse
|
16
|
Xie J, Ma Z, Chang D, Zhang G, Guo J. GPCA: A Probabilistic Framework for Gaussian Process Embedded Channel Attention. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:8230-8248. [PMID: 34375278 DOI: 10.1109/tpami.2021.3102955] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Channel attention mechanisms have been commonly applied in many visual tasks for effective performance improvement. It is able to reinforce the informative channels as well as to suppress the useless channels. Recently, different channel attention modules have been proposed and implemented in various ways. Generally speaking, they are mainly based on convolution and pooling operations. In this paper, we propose Gaussian process embedded channel attention (GPCA) module and further interpret the channel attention schemes in a probabilistic way. The GPCA module intends to model the correlations among the channels, which are assumed to be captured by beta distributed variables. As the beta distribution cannot be integrated into the end-to-end training of convolutional neural networks (CNNs) with a mathematically tractable solution, we utilize an approximation of the beta distribution to solve this problem. To specify, we adapt a Sigmoid-Gaussian approximation, in which the Gaussian distributed variables are transferred into the interval [0,1]. The Gaussian process is then utilized to model the correlations among different channels. In this case, a mathematically tractable solution is derived. The GPCA module can be efficiently implemented and integrated into the end-to-end training of the CNNs. Experimental results demonstrate the promising performance of the proposed GPCA module. Codes are available at https://github.com/PRIS-CV/GPCA.
Collapse
|
17
|
Hu W, Liu H, Du Y, Yuan C, Li B, Maybank S. Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:7010-7028. [PMID: 34314355 DOI: 10.1109/tpami.2021.3100277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
For CNN-based visual action recognition, the accuracy may be increased if local key action regions are focused on. The task of self-attention is to focus on key features and ignore irrelevant information. So, self-attention is useful for action recognition. However, current self-attention methods usually ignore correlations among local feature vectors at spatial positions in CNN feature maps. In this paper, we propose an effective interaction-aware self-attention model which can extract information about the interactions between feature vectors to learn attention maps. Since the different layers in a network capture feature maps at different scales, we introduce a spatial pyramid with the feature maps at different layers for attention modeling. The multi-scale information is utilized to obtain more accurate attention scores. These attention scores are used to weight the local feature vectors of the feature maps and then calculate attentional feature maps. Since the number of feature maps input to the spatial pyramid attention layer is unrestricted, we easily extend this attention layer to a spatio-temporal version. Our model can be embedded in any general CNN to form a video-level end-to-end attention network for action recognition. Several methods are investigated to combine the RGB and flow streams to obtain accurate predictions of human actions. Experimental results show that our method achieves state-of-the-art results on the datasets UCF101, HMDB51, Kinetics-400, and untrimmed Charades.
Collapse
|
18
|
Liu S, Li Y, Fu W. Human‐centered attention‐aware networks for action recognition. INT J INTELL SYST 2022. [DOI: 10.1002/int.23029] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
- Shuai Liu
- School of Educational Science Hunan Normal University Changsha China
- Key Laboratory of Big Data Research and Application for Basic Education Hunan Normal University Changsha China
- College of Information Science and Engineering Hunan Normal University Changsha China
| | - Yating Li
- College of Information Science and Engineering Hunan Normal University Changsha China
| | - Weina Fu
- College of Information Science and Engineering Hunan Normal University Changsha China
| |
Collapse
|
19
|
Fu J, Liu J, Jiang J, Li Y, Bao Y, Lu H. Scene Segmentation With Dual Relation-Aware Attention Network. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:2547-2560. [PMID: 32745005 DOI: 10.1109/tnnls.2020.3006524] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
In this article, we propose a Dual Relation-aware Attention Network (DRANet) to handle the task of scene segmentation. How to efficiently exploit context is essential for pixel-level recognition. To address the issue, we adaptively capture contextual information based on the relation-aware attention mechanism. Especially, we append two types of attention modules on the top of the dilated fully convolutional network (FCN), which model the contextual dependencies in spatial and channel dimensions, respectively. In the attention modules, we adopt a self-attention mechanism to model semantic associations between any two pixels or channels. Each pixel or channel can adaptively aggregate context from all pixels or channels according to their correlations. To reduce the high cost of computation and memory caused by the abovementioned pairwise association computation, we further design two types of compact attention modules. In the compact attention modules, each pixel or channel is built into association only with a few numbers of gathering centers and obtains corresponding context aggregation over these gathering centers. Meanwhile, we add a cross-level gating decoder to selectively enhance spatial details that boost the performance of the network. We conduct extensive experiments to validate the effectiveness of our network and achieve new state-of-the-art segmentation performance on four challenging scene segmentation data sets, i.e., Cityscapes, ADE20K, PASCAL Context, and COCO Stuff data sets. In particular, a Mean IoU score of 82.9% on the Cityscapes test set is achieved without using extra coarse annotated data.
Collapse
|
20
|
Mazhar O, Ramdani S, Cherubini A. A Deep Learning Framework for Recognizing Both Static and Dynamic Gestures. SENSORS (BASEL, SWITZERLAND) 2021; 21:2227. [PMID: 33806741 PMCID: PMC8004797 DOI: 10.3390/s21062227] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 03/10/2021] [Accepted: 03/17/2021] [Indexed: 11/16/2022]
Abstract
Intuitive user interfaces are indispensable to interact with the human centric smart environments. In this paper, we propose a unified framework that recognizes both static and dynamic gestures, using simple RGB vision (without depth sensing). This feature makes it suitable for inexpensive human-robot interaction in social or industrial settings. We employ a pose-driven spatial attention strategy, which guides our proposed Static and Dynamic gestures Network-StaDNet. From the image of the human upper body, we estimate his/her depth, along with the region-of-interest around his/her hands. The Convolutional Neural Network (CNN) in StaDNet is fine-tuned on a background-substituted hand gestures dataset. It is utilized to detect 10 static gestures for each hand as well as to obtain the hand image-embeddings. These are subsequently fused with the augmented pose vector and then passed to the stacked Long Short-Term Memory blocks. Thus, human-centred frame-wise information from the augmented pose vector and from the left/right hands image-embeddings are aggregated in time to predict the dynamic gestures of the performing person. In a number of experiments, we show that the proposed approach surpasses the state-of-the-art results on the large-scale Chalearn 2016 dataset. Moreover, we transfer the knowledge learned through the proposed methodology to the Praxis gestures dataset, and the obtained results also outscore the state-of-the-art on this dataset.
Collapse
Affiliation(s)
- Osama Mazhar
- LIRMM, Université de Montpellier, CNRS, 34392 Montpellier, France; (S.R.); (A.C.)
- Cognitive Robotics Department, Delft University of Technology, 2628 CD Delft, The Netherlands
| | - Sofiane Ramdani
- LIRMM, Université de Montpellier, CNRS, 34392 Montpellier, France; (S.R.); (A.C.)
| | - Andrea Cherubini
- LIRMM, Université de Montpellier, CNRS, 34392 Montpellier, France; (S.R.); (A.C.)
| |
Collapse
|
21
|
Ying Y, Zhang N, Shan P, Miao L, Sun P, Peng S. PSigmoid: Improving squeeze-and-excitation block with parametric sigmoid. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02247-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|