1
|
Tan B, Xiao Y, Li S, Tong X, Yan T, Cao Z, Tianyi Zhou J. Language-Guided 3-D Action Feature Learning Without Ground-Truth Sample Class Label. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:9356-9369. [PMID: 38865228 DOI: 10.1109/tnnls.2024.3409613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2024]
Abstract
This work pays the first research effort to leverage point cloud sequence-based Self-supervised 3-D Action Feature Learning (S3AFL), under text's cross-modality weak supervision. We intend to fill the huge performance gap between point cloud sequence and 3-D skeleton-based manners. The key intuition derives from the observation that skeleton-based manners actually hold the human pose's high-level knowledge that leads to attention on the body's joint-aware local parts. Inspired by this, we propose to introduce the text's weak supervision of high-level semantics into a point cloud sequence-based paradigm. With RGB-point cloud pair sequence acquired via RGB-D camera, text sequence is first generated from RGB component using pretrained image captioning model, as auxiliary weak supervision. Then, S3AFL runs in a cross and intra-modality contrastive learning (CL) way. To resist text's missing and redundant semantics, feature learning is conducted in a multistage way with semantic refinement. Essentially, text is only required for training. To facilitate the feature's representation power on fine-grained actions, a multirank max-pooling (MR-MP) way is also proposed for the point set network to better maintain discriminative clues. Experiments verify that the text's weak supervision can facilitate performance by 10.8%, 10.4%, and 8.0% on NTU RGB+D 60, 120, and N-UCLA at most. The performance gap between point cloud sequence and skeleton-based manners has been remarkably narrowed down. The idea of transferring text's weak supervision to S3AFL can also be applied to a skeleton manner, with strong generality. The source code is available at https://github.com/tangent-T/W3AMT.
Collapse
|
2
|
Gao L, Liu K, Guan L. A discriminative multi-modal adaptation neural network model for video action recognition. Neural Netw 2025; 185:107114. [PMID: 39827837 DOI: 10.1016/j.neunet.2024.107114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 12/10/2024] [Accepted: 12/28/2024] [Indexed: 01/22/2025]
Abstract
Research on video-based understanding and learning has attracted widespread interest and has been adopted in various real applications, such as e-healthcare, action recognition, affective computing, to name a few. Amongst them, video-based action recognition is one of the most representative examples. With the advancement of multi-sensory technology, action recognition using multi-modal data has recently drawn wide attention. However, the research community faces new challenges in effectively exploring and utilizing the discriminative and complementary information across different modalities. Although score level fusion approaches have been popularly employed for multi-modal action recognition, they simply add the scores derived separately from different modalities without proper consideration of cross-modality semantics amongst multiple input data sources, invariably causing sub-optimal performance. To address this issue, this paper presents a two-stream heterogeneous network to extract and jointly process complementary features derived from RGB and skeleton modalities, respectively. Then, a discriminative multi-modal adaptation neural network model (DMANNM) is proposed and applied to the heterogeneous network, by integrating statistical machine learning (SML) principles with convolutional neural network (CNN) architecture. In addition, to achieve high recognition accuracy by the generated multi-modal structure, an effective nonlinear classification algorithm is presented in this work. Leveraging the joint strength of SML and CNN architecture, the proposed model forms an adaptive platform for handling datasets of different scales. To demonstrate the effectiveness and the generic nature of the proposed model, we conducted experiments on four popular video-based action recognition datasets with different scales: NTU RGB+D, NTU RGB+D 120, Northwestern-UCLA (N-UCLA), and SYSU. The experimental results show the superiority of the proposed method over state-of-the-art compared.
Collapse
Affiliation(s)
- Lei Gao
- Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada
| | - Kai Liu
- Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada.
| | - Ling Guan
- Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada
| |
Collapse
|
3
|
Pang C, Gao X, Chen Z, Lyu L. Self-Adaptive Graph With Nonlocal Attention Network for Skeleton-Based Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17057-17069. [PMID: 37703156 DOI: 10.1109/tnnls.2023.3298950] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/15/2023]
Abstract
Graph convolutional networks (GCNs) have achieved encouraging progress in modeling human body skeletons as spatial-temporal graphs. However, existing methods still suffer from two inherent drawbacks. Firstly, these models process the input data based on the physical structure of the human body, which leads to some latent correlations among joints being ignored. Furthermore, the key temporal relationships between nonadjacent frames are overlooked, preventing to fully learn the changes of the body joints along the temporal dimension. To address these issues, we propose an innovative spatial-temporal model by introducing a self-adaptive GCN (SAGCN) with global attention network, collectively termed SAGGAN. Specifically, the SAGCN module is proposed to construct two additional dynamic topological graphs to learn the common characteristics of all data and represent a unique pattern for each sample, respectively. Meanwhile, the global attention module (spatial attention (SA) and temporal attention (TA) modules) is designed to extract the global connections between different joints in a single frame and model temporal relationships between adjacent and nonadjacent frames in temporal sequences. In this manner, our network can capture richer features of actions for accurate action recognition and overcome the defect of the standard graph convolution. Extensive experiments on three benchmark datasets (NTU-60, NTU-120, and Kinetics) have demonstrated the superiority of our proposed method.
Collapse
|
4
|
Tan B, Xiao Y, Wang Y, Li S, Yang J, Cao Z, Zhou JT, Yuan J. Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18186-18199. [PMID: 37729565 DOI: 10.1109/tnnls.2023.3312673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB+D 120, respectively. The source code is available at https://github.com/tangent-T/FACL.
Collapse
|
5
|
Chen J, Jiao L, Liu X, Liu F, Li L, Yang S. Multiresolution Interpretable Contourlet Graph Network for Image Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17716-17729. [PMID: 37747859 DOI: 10.1109/tnnls.2023.3307721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/27/2023]
Abstract
Modeling contextual relationships in images as graph inference is an interesting and promising research topic. However, existing approaches only perform graph modeling of entities, ignoring the intrinsic geometric features of images. To overcome this problem, a novel multiresolution interpretable contourlet graph network (MICGNet) is proposed in this article. MICGNet delicately balances graph representation learning with the multiscale and multidirectional features of images, where contourlet is used to capture the hyperplanar directional singularities of images and multilevel sparse contourlet coefficients are encoded into graph for further graph representation learning. This process provides interpretable theoretical support for optimizing the model structure. Specifically, first, the superpixel-based region graph is constructed. Then, the region graph is applied to code the nonsubsampled contourlet transform (NSCT) coefficients of the image, which are considered as node features. Considering the statistical properties of the NSCT coefficients, we calculate the node similarity, i.e., the adjacency matrix, using Mahalanobis distance. Next, graph convolutional networks (GCNs) are employed to further learn more abstract multilevel NSCT-enhanced graph representations. Finally, the learnable graph assignment matrix is designed to get the geometric association representations, which accomplish the assignment of graph representations to grid feature maps. We conduct comparative experiments on six publicly available datasets, and the experimental analysis shows that MICGNet is significantly more effective and efficient than other algorithms of recent years.
Collapse
|
6
|
Zhang X, Song D, Tao D. Ricci Curvature-Based Graph Sparsification for Continual Graph Representation Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17398-17410. [PMID: 37603471 DOI: 10.1109/tnnls.2023.3303454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/23/2023]
Abstract
Memory replay, which stores a subset of historical data from previous tasks to replay while learning new tasks, exhibits state-of-the-art performance for various continual learning applications on the Euclidean data. While topological information plays a critical role in characterizing graph data, existing memory replay-based graph learning techniques only store individual nodes for replay and do not consider their associated edge information. To this end, based on the message-passing mechanism in graph neural networks (GNNs), we present the Ricci curvature-based graph sparsification technique to perform continual graph representation learning. Specifically, we first develop the subgraph episodic memory (SEM) to store the topological information in the form of computation subgraphs. Next, we sparsify the subgraphs such that they only contain the most informative structures (nodes and edges). The informativeness is evaluated with the Ricci curvature, a theoretically justified metric to estimate the contribution of neighbors to represent a target node. In this way, we can reduce the memory consumption of a computation subgraph from to and enable GNNs to fully utilize the most informative topological information for memory replay. Besides, to ensure the applicability on large graphs, we also provide the theoretically justified surrogate for the Ricci curvature in the sparsification process, which can greatly facilitate the computation. Finally, our empirical studies show that SEM outperforms state-of-the-art approaches significantly on four different public datasets. Unlike existing methods, which mainly focus on task incremental learning (task-IL) setting, SEM also succeeds in the challenging class incremental learning (class-IL) setting in which the model is required to distinguish all learned classes without task indicators and even achieves comparable performance to joint training, which is the performance upper bound for continual learning.
Collapse
|
7
|
Lai Z, Zhang Y, Liang X. A Two-Stream Method for Human Action Recognition Using Facial Action Cues. SENSORS (BASEL, SWITZERLAND) 2024; 24:6817. [PMID: 39517714 PMCID: PMC11548224 DOI: 10.3390/s24216817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Revised: 10/14/2024] [Accepted: 10/17/2024] [Indexed: 11/16/2024]
Abstract
Human action recognition (HAR) is a critical area in computer vision with wide-ranging applications, including video surveillance, healthcare monitoring, and abnormal behavior detection. Current HAR methods predominantly rely on full-body data, which can limit their effectiveness in real-world scenarios where occlusion is common. In such situations, the face often remains visible, providing valuable cues for action recognition. This paper introduces Face in Action (FIA), a novel two-stream method that leverages facial action cues for robust action recognition under conditions of significant occlusion. FIA consists of an RGB stream and a landmark stream. The RGB stream processes facial image sequences using a fine-spatio-multitemporal (FSM) 3D convolution module, which employs smaller spatial receptive fields to capture detailed local facial movements and larger temporal receptive fields to model broader temporal dynamics. The landmark stream processes facial landmark sequences using a normalized temporal attention (NTA) module within an NTA-GCN block, enhancing the detection of key facial frames and improving overall recognition accuracy. We validate the effectiveness of FIA using the NTU RGB+D and NTU RGB+D 120 datasets, focusing on action categories related to medical conditions. Our experiments demonstrate that FIA significantly outperforms existing methods in scenarios with extensive occlusion, highlighting its potential for practical applications in surveillance and healthcare settings.
Collapse
Affiliation(s)
- Zhimao Lai
- School of Immigration Administration (Guangzhou), China People’s Police University, Guangzhou 510663, China;
| | - Yan Zhang
- School of Immigration Administration, China People’s Police University, Langfang 065000, China;
| | - Xiubo Liang
- School of Immigration Administration, China People’s Police University, Langfang 065000, China;
| |
Collapse
|
8
|
Li G, Cheng D, Ding X, Wang N, Li J, Gao X. Weakly Supervised Temporal Action Localization With Bidirectional Semantic Consistency Constraint. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13032-13045. [PMID: 37134038 DOI: 10.1109/tnnls.2023.3266062] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Weakly supervised temporal action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classification problem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be suboptimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene (i.e., the scene same as positive actions) as co-scene actions, this suboptimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC first adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video. Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet. The code is available at https://github.com/lgzlIlIlI/BiSCC.
Collapse
|
9
|
Castro-Correa JA, Giraldo JH, Badiey M, Malliaros FD. Gegenbauer Graph Neural Networks for Time-Varying Signal Reconstruction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:11734-11745. [PMID: 38598390 DOI: 10.1109/tnnls.2024.3381069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Reconstructing time-varying graph signals (or graph time-series imputation) is a critical problem in machine learning and signal processing with broad applications, ranging from missing data imputation in sensor networks to time-series forecasting. Accurately capturing the spatio-temporal information inherent in these signals is crucial for effectively addressing these tasks. However, existing approaches relying on smoothness assumptions of temporal differences and simple convex optimization techniques that have inherent limitations. To address these challenges, we propose a novel approach that incorporates a learning module to enhance the accuracy of the downstream task. To this end, we introduce the Gegenbauer-based graph convolutional (GegenConv) operator, which is a generalization of the conventional Chebyshev graph convolution by leveraging the theory of Gegenbauer polynomials. By deviating from traditional convex problems, we expand the complexity of the model and offer a more accurate solution for recovering time-varying graph signals. Building upon GegenConv, we design the Gegenbauer-based time graph neural network (GegenGNN) architecture, which adopts an encoder-decoder structure. Likewise, our approach also uses a dedicated loss function that incorporates a mean squared error (MSE) component alongside Sobolev smoothness regularization. This combination enables GegenGNN to capture both the fidelity to ground truth and the underlying smoothness properties of the signals, enhancing the reconstruction performance. We conduct extensive experiments on real datasets to evaluate the effectiveness of our proposed approach. The experimental results demonstrate that GegenGNN outperforms state-of-the-art methods, showcasing its superior capability in recovering time-varying graph signals.
Collapse
|
10
|
Gao X, Yang Y, Wu Y, Du S. Learning Heterogeneous Spatial-Temporal Context for Skeleton-Based Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12130-12141. [PMID: 37030786 DOI: 10.1109/tnnls.2023.3252172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Graph convolution networks (GCNs) have been widely used and achieved fruitful progress in the skeleton-based action recognition task. In GCNs, node interaction modeling dominates the context aggregation and, therefore, is crucial for a graph-based convolution kernel to extract representative features. In this article, we introduce a closer look at a powerful graph convolution formulation to capture rich movement patterns from these skeleton-based graphs. Specifically, we propose a novel heterogeneous graph convolution (HetGCN) that can be considered as the middle ground between the extremes of (2 + 1)-D and 3-D graph convolution. The core observation of HetGCN is that multiple information flows are jointly intertwined in a 3-D convolution kernel, including spatial, temporal, and spatial-temporal cues. Since spatial and temporal information flows characterize different cues for action recognition, HetGCN first dynamically analyzes pairwise interactions between each node and its cross-space-time neighbors and then encourages heterogeneous context aggregation among them. Considering the HetGCN as a generic convolution formulation, we further develop it into two specific instantiations (i.e., intra-scale and inter-scale HetGCN) that significantly facilitate cross-space-time and cross-scale learning on skeleton graphs. By integrating these modules, we propose a strong human action recognition system that outperforms state-of-the-art methods with the accuracy of 93.1% on NTU-60 cross-subject (X-Sub) benchmark, 88.9% on NTU-120 X-Sub benchmark, and 38.4% on kinetics skeleton.
Collapse
|
11
|
Li R, Chen H, Feng F, Ma Z, Wang X, Hovy E. DualGCN: Exploring Syntactic and Semantic Information for Aspect-Based Sentiment Analysis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7642-7656. [PMID: 36374886 DOI: 10.1109/tnnls.2022.3219615] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The task of aspect-based sentiment analysis aims to identify sentiment polarities of given aspects in a sentence. Recent advances have demonstrated the advantage of incorporating the syntactic dependency structure with graph convolutional networks (GCNs). However, their performance of these GCN-based methods largely depends on the dependency parsers, which would produce diverse parsing results for a sentence. In this article, we propose a dual GCN (DualGCN) that jointly considers the syntax structures and semantic correlations. Our DualGCN model mainly comprises four modules: 1) SynGCN: instead of explicitly encoding syntactic structure, the SynGCN module uses the dependency probability matrix as a graph structure to implicitly integrate the syntactic information; 2) SemGCN: we design the SemGCN module with multihead attention to enhance the performance of the syntactic structure with the semantic information; 3) Regularizers: we propose orthogonal and differential regularizers to precisely capture semantic correlations between words by constraining attention scores in the SemGCN module; and 4) Mutual BiAffine: we use the BiAffine module to bridge relevant information between the SynGCN and SemGCN modules. Extensive experiments are conducted compared with up-to-date pretrained language encoders on two groups of datasets, one including Restaurant14, Laptop14, and Twitter and the other including Restaurant15 and Restaurant16. The experimental results demonstrate that the parsing results of various dependency parsers affect their performance of the GCN-based models. Our DualGCN model achieves superior performance compared with the state-of-the-art approaches. The source code and preprocessed datasets are provided and publicly available on GitHub (see https://github.com/CCChenhao997/DualGCN-ABSA).
Collapse
|
12
|
Wang Y, Chang D, Fu Z, Zhao Y. Seeing All From a Few: Nodes Selection Using Graph Pooling for Graph Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7231-7237. [PMID: 36215388 DOI: 10.1109/tnnls.2022.3210370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Recently, there has been considerable research interest in graph clustering aimed at data partition using graph information. However, one limitation of most graph-based methods is that they assume that the graph structure to operate is reliable. However, there are inevitably some edges in the graph that are not conducive to graph clustering, which we call spurious edges. This brief is the first attempt to employ the graph pooling technique for node clustering to the best of our knowledge. In this brief, we propose a novel dual graph embedding network (DGEN), which is designed as a two-step graph encoder connected by a graph pooling layer to learn the graph embedding. In DGEN, we assume that if a node and its nearest neighboring node are close to the same clustering center, this node is informative, and this edge can be considered as a cluster-friendly edge. Based on this assumption, the neighbor cluster pooling (NCPool) is devised to select the most informative subset of nodes and the corresponding edges based on the distance of nodes and their nearest neighbors to the cluster centers. This can effectively alleviate the impact of the spurious edges on the clustering. Finally, to obtain the clustering assignment of all nodes, a classifier is trained using the clustering results of the selected nodes. Experiments on five benchmark graph datasets demonstrate the superiority of the proposed method over state-of-the-art algorithms.
Collapse
|
13
|
Xia G, Xue P, Zhang D, Liu Q, Sun Y. A Deep Learning Framework for Start-End Frame Pair-Driven Motion Synthesis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7021-7034. [PMID: 36264719 DOI: 10.1109/tnnls.2022.3213596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
A start-end frame pair and a motion pattern-based motion synthesis scheme can provide more control to the synthesis process and produce content-various motion sequences. However, the data preparation for the motion training is intractable, and concatenating feature spaces of the start-end frame pair and the motion pattern lacks theoretical rationality in previous works. In this article, we propose a deep learning framework that completes automatic data preparation and learns the nonlinear mapping from start-end frame pairs to motion patterns. The proposed model consists of three modules: action detection, motion extraction, and motion synthesis networks. The action detection network extends the deep subspace learning framework to a supervised version, i.e., uses the local self-expression (LSE) of the motion data to supervise feature learning and complement the classification error. A long short-term memory (LSTM)-based network is used to efficiently extract the motion patterns to address the speed deficiency reflected in the previous optimization-based method. A motion synthesis network consists of a group of LSTM-based blocks, where each of them is to learn the nonlinear relation between the start-end frame pairs and the motion patterns of a certain joint. The superior performances in action detection accuracy, motion pattern extraction efficiency, and motion synthesis quality show the effectiveness of each module in the proposed framework.
Collapse
|
14
|
Cui L, Bai L, Bai X, Wang Y, Hancock ER. Learning Aligned Vertex Convolutional Networks for Graph Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:4423-4437. [PMID: 34890333 DOI: 10.1109/tnnls.2021.3129649] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Graph convolutional networks (GCNs) are powerful tools for graph structure data analysis. One main drawback arising in most existing GCN models is that of the oversmoothing problem, i.e., the vertex features abstracted from the existing graph convolution operation have previously tended to be indistinguishable if the GCN model has many convolutional layers (e.g., more than two layers). To address this problem, in this article, we propose a family of aligned vertex convolutional network (AVCN) models that focus on learning multiscale features from local-level vertices for graph classification. This is done by adopting a transitive vertex alignment algorithm to transform arbitrary-sized graphs into fixed-size grid structures. Furthermore, we define a new aligned vertex convolution operation that can effectively learn multiscale vertex characteristics by gradually aggregating local-level neighboring aligned vertices residing on the original grid structures into a new packed aligned vertex. With the new vertex convolution operation to hand, we propose two architectures for the AVCN models to extract different hierarchical multiscale vertex feature representations for graph classification. We show that the proposed models can avoid iteratively propagating redundant information between specific neighboring vertices, restricting the notorious oversmoothing problem arising in most spatial-based GCN models. Experimental evaluations on benchmark datasets demonstrate the effectiveness.
Collapse
|
15
|
Huang CQ, Jiang F, Huang QH, Wang XZ, Han ZM, Huang WY. Dual-Graph Attention Convolution Network for 3-D Point Cloud Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:4813-4825. [PMID: 35385393 DOI: 10.1109/tnnls.2022.3162301] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Three-dimensional point cloud classification is fundamental but still challenging in 3-D vision. Existing graph-based deep learning methods fail to learn both low-level extrinsic and high-level intrinsic features together. These two levels of features are critical to improving classification accuracy. To this end, we propose a dual-graph attention convolution network (DGACN). The idea of DGACN is to use two types of graph attention convolution operations with a feedback graph feature fusion mechanism. Specifically, we exploit graph geometric attention convolution to capture low-level extrinsic features in 3-D space. Furthermore, we apply graph embedding attention convolution to learn multiscale low-level extrinsic and high-level intrinsic fused graph features together. Moreover, the points belonging to different parts in real-world 3-D point cloud objects are distinguished, which results in more robust performance for 3-D point cloud classification tasks than other competitive methods, in practice. Our extensive experimental results show that the proposed network achieves state-of-the-art performance on both the synthetic ModelNet40 and real-world ScanObjectNN datasets.
Collapse
|
16
|
Du Z, Ye H, Cao F. A Novel Local-Global Graph Convolutional Method for Point Cloud Semantic Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:4798-4812. [PMID: 35286267 DOI: 10.1109/tnnls.2022.3155282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Although convolutional neural networks (CNNs) have shown good performance on grid data, they are limited in the semantic segmentation of irregular point clouds. This article proposes a novel and effective graph CNN framework, referred to as the local-global graph convolutional method (LGGCM), which can achieve short- and long-range dependencies on point clouds. The key to this framework is the design of local spatial attention convolution (LSA-Conv). The design includes two parts: generating a weighted adjacency matrix of the local graph composed of neighborhood points, and updating and aggregating the features of nodes to obtain the spatial geometric features of the local point cloud. In addition, a smooth module for central points is incorporated into the process of LSA-Conv to enhance the robustness of the convolution against noise interference by adjusting the position coordinates of the points adaptively. The learned robust LSA-Conv features are then fed into a global spatial attention module with the gated unit to extract long-range contextual information and dynamically adjust the weights of features from different stages. The proposed framework, consisting of both encoding and decoding branches, is an end-to-end trainable network for semantic segmentation of 3-D point clouds. The theoretical analysis of the approximation capabilities of LSA-Conv is discussed to determine whether the features of the point cloud can be accurately represented. Experimental results on challenging benchmarks of the 3-D point cloud demonstrate that the proposed framework achieves excellent performance.
Collapse
|
17
|
Zhu P, Li J, Wang Y, Xiao B, Zhao S, Hu Q. Collaborative Decision-Reinforced Self-Supervision for Attributed Graph Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:10851-10863. [PMID: 35584075 DOI: 10.1109/tnnls.2022.3171583] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Attributed graph clustering aims to partition nodes of a graph structure into different groups. Recent works usually use variational graph autoencoder (VGAE) to make the node representations obey a specific distribution. Although they have shown promising results, how to introduce supervised information to guide the representation learning of graph nodes and improve clustering performance is still an open problem. In this article, we propose a Collaborative Decision-Reinforced Self-Supervision (CDRS) method to solve the problem, in which a pseudo node classification task collaborates with the clustering task to enhance the representation learning of graph nodes. First, a transformation module is used to enable end-to-end training of existing methods based on VGAE. Second, the pseudo node classification task is introduced into the network through multitask learning to make classification decisions for graph nodes. The graph nodes that have consistent decisions on clustering and pseudo node classification are added to a pseudo-label set, which can provide fruitful self-supervision for subsequent training. This pseudo-label set is gradually augmented during training, thus reinforcing the generalization capability of the network. Finally, we investigate different sorting strategies to further improve the quality of the pseudo-label set. Extensive experiments on multiple datasets show that the proposed method achieves outstanding performance compared with state-of-the-art methods. Our code is available at https://github.com/Jillian555/TNNLS_CDRS.
Collapse
|
18
|
Zhang X, Song D, Tao D. Hierarchical Prototype Networks for Continual Graph Representation Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:4622-4636. [PMID: 37028338 DOI: 10.1109/tpami.2022.3186909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Despite significant advances in graph representation learning, little attention has been paid to the more practical continual learning scenario in which new categories of nodes (e.g., new research areas in citation networks, or new types of products in co-purchasing networks) and their associated edges are continuously emerging, causing catastrophic forgetting on previous categories. Existing methods either ignore the rich topological information or sacrifice plasticity for stability. To this end, we present Hierarchical Prototype Networks (HPNs) which extract different levels of abstract knowledge in the form of prototypes to represent the continuously expanded graphs. Specifically, we first leverage a set of Atomic Feature Extractors (AFEs) to encode both the elemental attribute information and the topological structure of the target node. Next, we develop HPNs to adaptively select relevant AFEs and represent each node with three levels of prototypes. In this way, whenever a new category of nodes is given, only the relevant AFEs and prototypes at each level will be activated and refined, while others remain uninterrupted to maintain the performance over existing nodes. Theoretically, we first demonstrate that the memory consumption of HPNs is bounded regardless of how many tasks are encountered. Then, we prove that under mild constraints, learning new tasks will not alter the prototypes matched to previous data, thereby eliminating the forgetting problem. The theoretical results are supported by experiments on five datasets, showing that HPNs not only outperform state-of-the-art baseline techniques but also consume relatively less memory. Code and datasets are available at https://github.com/QueuQ/HPNs.
Collapse
|
19
|
Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J. Human Action Recognition From Various Data Modalities: A Review. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:3200-3225. [PMID: 35700242 DOI: 10.1109/tpami.2022.3183112] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Human Action Recognition (HAR) aims to understand human behavior and assign a label to each action. It has a wide range of applications, and therefore has been attracting increasing attention in the field of computer vision. Human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, radar, and WiFi signal, which encode different sources of useful yet distinct information and have various advantages depending on the application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this article, we present a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality. Specifically, we review the current mainstream deep learning methods for single data modalities and multiple data modalities, including the fusion-based and the co-learning-based frameworks. We also present comparative results on several benchmark datasets for HAR, together with insightful observations and inspiring future research directions.
Collapse
|
20
|
Song YF, Zhang Z, Shan C, Wang L. Constructing Stronger and Faster Baselines for Skeleton-Based Action Recognition. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:1474-1488. [PMID: 35254974 DOI: 10.1109/tpami.2022.3157033] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where "x" denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 92.1% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 5.82× smaller and 5.85× faster than MS-G3D, which is one of the SOTA methods. The source code in PyTorch version and the pretrained models are available at https://github.com/yfsong0709/EfficientGCNv1.
Collapse
|
21
|
Guo D, Xu C, Tao D. Bilinear Graph Networks for Visual Question Answering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:1023-1034. [PMID: 34428156 DOI: 10.1109/tnnls.2021.3104937] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This article revisits the bilinear attention networks (BANs) in the visual question answering task from a graph perspective. The classical BANs build a bilinear attention map to extract the joint representation of words in the question and objects in the image but lack fully exploring the relationship between words for complex reasoning. In contrast, we develop bilinear graph networks to model the context of the joint embeddings of words and objects. Two kinds of graphs are investigated, namely, image-graph and question-graph. The image-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The question-graph exchanges information between these output nodes from image-graph to amplify the implicit yet important relationship between objects. These two kinds of graphs cooperate with each other, and thus, our resulting model can build the relationship and dependency between objects, which leads to the realization of multistep reasoning. Experimental results on the VQA v2.0 validation dataset demonstrate the ability of our method to handle complex questions. On the test-std set, our best single model achieves state-of-the-art performance, boosting the overall accuracy to 72.56%, and we are one of the top-two entries in the VQA Challenge 2020.
Collapse
|
22
|
Zhu Q, Deng H. Spatial adaptive graph convolutional network for skeleton-based action recognition. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04442-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
AbstractIn recent years, great achievements have been made in graph convolutional network (GCN) for non-Euclidean spatial data feature extraction, especially the skeleton-based feature extraction. However, the fixed graph structure determined by the fixed adjacency matrix usually causes the problems such as the weak spatial modeling ability, the unsatisfactory generalization performance, the excessively large number of model parameters, and so on. In this paper, a spatially adaptive residual graph convolutional network (SARGCN) is proposed for action recognition based on skeleton feature extraction. Firstly, the uniform and fixed topology is not required in our graph. Secondly, a learnable parameter matrix is added to the GCN operation, which can enhance the model’s capabilities of feature extraction and generalization, while reducing the number of parameters. Therefore, compared with the several existing models mentioned in this paper, the least number of parameters are used in our model while ensuring the comparable recognition accuracy. Finally, inspired by the ResNet architecture, a residual connection is introduced in GCN to obtain higher accuracy at lower computational costs and learning difficulties. Extensive experimental on two large-scale datasets results validate the effectiveness of our proposed approach, namely NTU RGB+D 60 and NTU RGB+D 120.
Collapse
|
23
|
Yang Y, Sun Y, Ju F, Wang S, Gao J, Yin B. Multi-graph Fusion Graph Convolutional Networks with pseudo-label supervision. Neural Netw 2023; 158:305-317. [PMID: 36493533 DOI: 10.1016/j.neunet.2022.11.027] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 09/13/2022] [Accepted: 11/21/2022] [Indexed: 11/29/2022]
Abstract
Graph convolutional networks (GCNs) have become a popular tool for learning unstructured graph data due to their powerful learning ability. Many researchers have been interested in fusing topological structures and node features to extract the correlation information for classification tasks. However, it is inadequate to integrate the embedding from topology and feature spaces to gain the most correlated information. At the same time, most GCN-based methods assume that the topology graph or feature graph is compatible with the properties of GCNs, but this is usually not satisfied since meaningless, missing, or even unreal edges are very common in actual graphs. To obtain a more robust and accurate graph structure, we intend to construct an adaptive graph with topology and feature graphs. We propose Multi-graph Fusion Graph Convolutional Networks with pseudo-label supervision (MFGCN), which learn a connected embedding by fusing the multi-graphs and node features. We can obtain the final node embedding for semi-supervised node classification by propagating node features over multi-graphs. Furthermore, to alleviate the problem of labels missing in semi-supervised classification, a pseudo-label generation mechanism is proposed to generate more reliable pseudo-labels based on the similarity of node features. Extensive experiments on six benchmark datasets demonstrate the superiority of MFGCN over state-of-the-art classification methods.
Collapse
Affiliation(s)
- Yachao Yang
- Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Yanfeng Sun
- Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China.
| | - Fujiao Ju
- Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Shaofan Wang
- Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Junbin Gao
- Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, NSW 2006, Australia
| | - Baocai Yin
- Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| |
Collapse
|
24
|
Li Z, Gong X, Song R, Duan P, Liu J, Zhang W. SMAM: Self and Mutual Adaptive Matching for Skeleton-Based Few-Shot Action Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 32:392-402. [PMID: 37015477 DOI: 10.1109/tip.2022.3226410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
This paper focuses on skeleton-based few-shot action recognition. Since skeleton is essentially a sparse representation of human action, the feature maps extracted from it, through a standard encoder network in the few-shot condition, may not be sufficiently discriminative for some action sequences that look partially similar to each other. To address this issue, we propose a self and mutual adaptive matching (SMAM) module to convert such feature maps into more discriminative feature vectors. Our method, named as SMAM-Net, first leverages both the temporal information associated with each individual skeleton joint and the spatial relationship among them for feature extraction. Then, the SMAM module adaptively measures the similarity between labeled and query samples and further carries out feature matching within the query set to distinguish similar skeletons of various action categories. Experimental results show that the SMAM-Net outperforms other baselines on the large-scale NTU RGB + D 120 dataset in the tasks of one-shot and five-shot action recognition. We also report our results on smaller datasets including NTU RGB + D 60, SYSU and PKU-MMD to demonstrate that our method is reliable and generalises well on different datasets. Codes and the pretrained SMAM-Net will be made publicly available.
Collapse
|
25
|
Jaramillo IE, Jeong JG, Lopez PR, Lee CH, Kang DY, Ha TJ, Oh JH, Jung H, Lee JH, Lee WH, Kim TS. Real-Time Human Activity Recognition with IMU and Encoder Sensors in Wearable Exoskeleton Robot via Deep Learning Networks. SENSORS (BASEL, SWITZERLAND) 2022; 22:9690. [PMID: 36560059 PMCID: PMC9783602 DOI: 10.3390/s22249690] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 12/02/2022] [Accepted: 12/08/2022] [Indexed: 06/17/2023]
Abstract
Wearable exoskeleton robots have become a promising technology for supporting human motions in multiple tasks. Activity recognition in real-time provides useful information to enhance the robot's control assistance for daily tasks. This work implements a real-time activity recognition system based on the activity signals of an inertial measurement unit (IMU) and a pair of rotary encoders integrated into the exoskeleton robot. Five deep learning models have been trained and evaluated for activity recognition. As a result, a subset of optimized deep learning models was transferred to an edge device for real-time evaluation in a continuous action environment using eight common human tasks: stand, bend, crouch, walk, sit-down, sit-up, and ascend and descend stairs. These eight robot wearer's activities are recognized with an average accuracy of 97.35% in real-time tests, with an inference time under 10 ms and an overall latency of 0.506 s per recognition using the selected edge device.
Collapse
Affiliation(s)
- Ismael Espinoza Jaramillo
- Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Republic of Korea
| | - Jin Gyun Jeong
- Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Republic of Korea
| | | | | | - Do-Yeon Kang
- Hyundai Rotem, Uiwang-si 16082, Republic of Korea
| | - Tae-Jun Ha
- Hyundai Rotem, Uiwang-si 16082, Republic of Korea
| | - Ji-Heon Oh
- Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Republic of Korea
| | - Hwanseok Jung
- Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Republic of Korea
| | - Jin Hyuk Lee
- Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Republic of Korea
| | - Won Hee Lee
- Department of Software Convergence, Kyung Hee University, Yongin 17104, Republic of Korea
| | - Tae-Seong Kim
- Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin 17104, Republic of Korea
| |
Collapse
|
26
|
Meng J, Zheng WS, Lai JH, Wang L. Deep Graph Metric Learning for Weakly Supervised Person Re-Identification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:6074-6093. [PMID: 34048336 DOI: 10.1109/tpami.2021.3084613] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In conventional person re-identification (re-id), the images used for model training in the training probe set and training gallery set are all assumed to be instance-level samples that are manually labeled from raw surveillance video (likely with the assistance of detection) in a frame-by-frame manner. This labeling across multiple non-overlapping camera views from raw video surveillance is expensive and time consuming. To overcome these issues, we consider a weakly supervised person re-id modeling that aims to find the raw video clips where a given target person appears. In our weakly supervised setting, during training, given a sample of a person captured in one camera view, our weakly supervised approach aims to train a re-id model without further instance-level labeling for this person in another camera view. The weak setting refers to matching a target person with an untrimmed gallery video where we only know that the identity appears in the video without the requirement of annotating the identity in any frame of the video during the training procedure. The weakly supervised person re-id is challenging since it not only suffers from the difficulties occurring in conventional person re-id (e.g., visual ambiguity and appearance variations caused by occlusions, pose variations, background clutter, etc.), but more importantly, is also challenged by weakly supervised information because the instance-level labels and the ground-truth locations for person instances (i.e., the ground-truth bounding boxes of person instances) are absent. To solve the weakly supervised person re-id problem, we develop deep graph metric learning (DGML). On the one hand, DGML measures the consistency between intra-video spatial graphs of consecutive frames, where the spatial graph captures neighborhood relationship about the detected person instances in each frame. On the other hand, DGML distinguishes the inter-video spatial graphs captured from different camera views at different sites simultaneously. To further explicitly embed weak supervision into the DGML and solve the weakly supervised person re-id problem, we introduce weakly supervised regularization (WSR), which utilizes multiple weak video-level labels to learn discriminative features by means of a weak identity loss and a cross-video alignment loss. We conduct extensive experiments to demonstrate the feasibility of the weakly supervised person re-id approach and its special cases (e.g., its bag-to-bag extension) and show that the proposed DGML is effective.
Collapse
|
27
|
Li G, Li N, Chang F, Liu C. Adaptive Graph Convolutional Network With Adversarial Learning for Skeleton-Based Action Prediction. IEEE Trans Cogn Dev Syst 2022. [DOI: 10.1109/tcds.2021.3103960] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Guangxin Li
- School of Control Science and Engineering, Shandong University, Jinan, China
| | - Nanjun Li
- School of Control Science and Engineering, Shandong University, Jinan, China
| | - Faliang Chang
- School of Control Science and Engineering, Shandong University, Jinan, China
| | - Chunsheng Liu
- School of Control Science and Engineering, Shandong University, Jinan, China
| |
Collapse
|
28
|
Causality extraction model based on two-stage GCN. Soft comput 2022. [DOI: 10.1007/s00500-022-07370-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
29
|
Li Y, Zhang Y, Cui W, Lei B, Kuang X, Zhang T. Dual Encoder-Based Dynamic-Channel Graph Convolutional Network With Edge Enhancement for Retinal Vessel Segmentation. IEEE TRANSACTIONS ON MEDICAL IMAGING 2022; 41:1975-1989. [PMID: 35167444 DOI: 10.1109/tmi.2022.3151666] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Retinal vessel segmentation with deep learning technology is a crucial auxiliary method for clinicians to diagnose fundus diseases. However, the deep learning approaches inevitably lose the edge information, which contains spatial features of vessels while performing down-sampling, leading to the limited segmentation performance of fine blood vessels. Furthermore, the existing methods ignore the dynamic topological correlations among feature maps in the deep learning framework, resulting in the inefficient capture of the channel characterization. To address these limitations, we propose a novel dual encoder-based dynamic-channel graph convolutional network with edge enhancement (DE-DCGCN-EE) for retinal vessel segmentation. Specifically, we first design an edge detection-based dual encoder to preserve the edge of vessels in down-sampling. Secondly, we investigate a dynamic-channel graph convolutional network to map the image channels to the topological space and synthesize the features of each channel on the topological map, which solves the limitation of insufficient channel information utilization. Finally, we study an edge enhancement block, aiming to fuse the edge and spatial features in the dual encoder, which is beneficial to improve the accuracy of fine blood vessel segmentation. Competitive experimental results on five retinal image datasets validate the efficacy of the proposed DE-DCGCN-EE, which achieves more remarkable segmentation results against the other state-of-the-art methods, indicating its potential clinical application.
Collapse
|
30
|
Zhu R. Research on the Evaluation of Moral Education Effectiveness and Student Behavior in Universities under the Environment of Big Data. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:2832661. [PMID: 35942466 PMCID: PMC9356784 DOI: 10.1155/2022/2832661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 06/17/2022] [Accepted: 07/06/2022] [Indexed: 11/17/2022]
Abstract
Traditional moral evaluation relies on artificial and subjective evaluation by teachers, and there are subjective errors or prejudices. To achieve further objective evaluation, students' classroom performance can be identified, and the effectiveness of moral education can be evaluated based on student behavior. Since student classroom behavior is random and uncertain, in order to accurately evaluate its indicators, a large amount of student classroom behavior data must be used as the basis for analysis, while certain techniques are used to filter out valuable information from it. In this paper, an improved graph convolutional network algorithm is proposed to study students' behaviors in order to further improve the accuracy of moral education evaluation in universities. The technique of video recognition is used to achieve student behavior recognition, thus helping to improve the quality of moral education evaluation in colleges and universities. First, the multi-information flow data related to nodes and skeletons are fused to improve the computing speed by reducing the number of network parameters. Second, the spatiotemporal attention module based on nonlocal operations is constructed to focus on the most action discriminative nodes and improve the recognition accuracy by reducing redundant information. Then, the spatiotemporal feature extraction module is constructed to obtain the spatiotemporal association information of the nodes of interest. Finally, the action recognition is realized by the Softmax layer. The experimental results show that the algorithm of action recognition in this paper is more accurate and can better help moral evaluation.
Collapse
Affiliation(s)
- Rui Zhu
- Publicity Department, Shandong Management University, Jinan, Shandong 250000, China
| |
Collapse
|
31
|
Liu XL. Dance Movement Recognition Based on Multimodal Environmental Monitoring Data. JOURNAL OF ENVIRONMENTAL AND PUBLIC HEALTH 2022; 2022:1568930. [PMID: 35903182 PMCID: PMC9325569 DOI: 10.1155/2022/1568930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2022] [Revised: 06/13/2022] [Accepted: 06/17/2022] [Indexed: 11/24/2022]
Abstract
Fine motion recognition is a challenging topic in computer vision, and it has been a trendy research direction in recent years. This study combines motion recognition technology with dance movements and the problems such as the high complexity of dance movements and fully considers the human body's self-occlusion. The excellent motion recognition content in the dance field was studied and analyzed. A compelling feature extraction method was proposed for the dance video dataset, segmented video, and accumulated edge feature operation. By extracting directional gradient histogram features, a set of directional gradient histogram feature vectors is used to characterize the shape features of the dance video movements. A dance movement recognition method is adopted based on the fusion direction gradient histogram feature, optical flow direction histogram feature, and audio signature feature. Three components are combined for dance movement recognition by a multicore learning method. Experimental results show that the cumulative edge feature algorithm proposed in this study outperforms traditional models in the recognition results of HOG features extracted from images. After adding edge features, the description of the dance movement shape is more effective. The algorithm can guarantee a specific recognition rate of complex dance movements. The results also verify the effectiveness of the movement recognition algorithm in this study for dance movement recognition.
Collapse
Affiliation(s)
- Xiao Lei Liu
- Music and Dance College of Xinyang Normal University, Xinyang, Henan 464000, China
| |
Collapse
|
32
|
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q. Symbiotic Graph Neural Networks for 3D Skeleton-Based Human Action Recognition and Motion Prediction. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:3316-3333. [PMID: 33481706 DOI: 10.1109/tpami.2021.3053765] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
3D skeleton-based action recognition and motion prediction are two essential problems of human activity understanding. In many previous works: 1) they studied two tasks separately, neglecting internal correlations; and 2) they did not capture sufficient relations inside the body. To address these issues, we propose a symbiotic model to handle two tasks jointly; and we propose two scales of graphs to explicitly capture relations among body-joints and body-parts. Together, we propose symbiotic graph neural networks, which contain a backbone, an action-recognition head, and a motion-prediction head. Two heads are trained jointly and enhance each other. For the backbone, we propose multi-branch multiscale graph convolution networks to extract spatial and temporal features. The multiscale graph convolution networks are based on joint-scale and part-scale graphs. The joint-scale graphs contain actional graphs, capturing action-based relations, and structural graphs, capturing physical constraints. The part-scale graphs integrate body-joints to form specific parts, representing high-level relations. Moreover, dual bone-based graphs and networks are proposed to learn complementary features. We conduct extensive experiments for skeleton-based action recognition and motion prediction with four datasets, NTU-RGB+D, Kinetics, Human3.6M, and CMU Mocap. Experiments show that our symbiotic graph neural networks achieve better performances on both tasks compared to the state-of-the-art methods.
Collapse
|
33
|
Feng M, Meunier J. Skeleton Graph-Neural-Network-Based Human Action Recognition: A Survey. SENSORS (BASEL, SWITZERLAND) 2022; 22:2091. [PMID: 35336262 PMCID: PMC8952863 DOI: 10.3390/s22062091] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 02/21/2022] [Accepted: 02/24/2022] [Indexed: 02/01/2023]
Abstract
Human action recognition has been applied in many fields, such as video surveillance and human computer interaction, where it helps to improve performance. Numerous reviews of the literature have been done, but rarely have these reviews concentrated on skeleton-graph-based approaches. Connecting the skeleton joints as in the physical appearance can naturally generate a graph. This paper provides an up-to-date review for readers on skeleton graph-neural-network-based human action recognition. After analyzing previous related studies, a new taxonomy for skeleton-GNN-based methods is proposed according to their designs, and their merits and demerits are analyzed. In addition, the datasets and codes are discussed. Finally, future research directions are suggested.
Collapse
Affiliation(s)
| | - Jean Meunier
- Department of Computer Science and Operations Research, University of Montreal, Montreal, QC H3C 3J7, Canada;
| |
Collapse
|
34
|
Yadav SK, Tiwari K, Pandey HM, Akbar SA. Skeleton-based human activity recognition using ConvLSTM and guided feature learning. Soft comput 2022. [DOI: 10.1007/s00500-021-06238-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
AbstractHuman activity recognition aims to determine actions performed by a human in an image or video. Examples of human activity include standing, running, sitting, sleeping, etc. These activities may involve intricate motion patterns and undesired events such as falling. This paper proposes a novel deep convolutional long short-term memory (ConvLSTM) network for skeletal-based activity recognition and fall detection. The proposed ConvLSTM network is a sequential fusion of convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and fully connected layers. The acquisition system applies human detection and pose estimation to pre-calculate skeleton coordinates from the image/video sequence. The ConvLSTM model uses the raw skeleton coordinates along with their characteristic geometrical and kinematic features to construct the novel guided features. The geometrical and kinematic features are built upon raw skeleton coordinates using relative joint position values, differences between joints, spherical joint angles between selected joints, and their angular velocities. The novel spatiotemporal-guided features are obtained using a trained multi-player CNN-LSTM combination. Classification head including fully connected layers is subsequently applied. The proposed model has been evaluated on the KinectHAR dataset having 130,000 samples with 81 attribute values, collected with the help of a Kinect (v2) sensor. Experimental results are compared against the performance of isolated CNNs and LSTM networks. Proposed ConvLSTM have achieved an accuracy of 98.89% that is better than CNNs and LSTMs having an accuracy of 93.89 and 92.75%, respectively. The proposed system has been tested in realtime and is found to be independent of the pose, facing of the camera, individuals, clothing, etc. The code and dataset will be made publicly available.
Collapse
|
35
|
Relation Selective Graph Convolutional Network for Skeleton-Based Action Recognition. Symmetry (Basel) 2021. [DOI: 10.3390/sym13122275] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Graph convolutional networks (GCNs) have made significant progress in the skeletal action recognition task. However, the graphs constructed by these methods are too densely connected, and the same graphs are used repeatedly among channels. Redundant connections will blur the useful interdependencies of joints, and the overly repetitive graphs among channels cannot handle changes in joint relations between different actions. In this work, we propose a novel relation selective graph convolutional network (RS-GCN). We also design a trainable relation selection mechanism. It encourages the model to choose solid edges to work and build a stable and sparse topology of joints. The channel-wise graph convolution and multiscale temporal convolution are proposed to strengthening the model’s representative power. Furthermore, we introduce an asymmetrical module named the spatial-temporal attention module for more stable context modeling. Combining those changes, our model achieves state-of-the-art performance on three public benchmarks, namely NTU-RGB+D, NTU-RGB+D 120, and Northwestern-UCLA.
Collapse
|
36
|
Local2Global: Unsupervised multi-view deep graph representation learning with Nearest Neighbor Constraint. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107439] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
37
|
Li M, Chen S, Zhao Y, Zhang Y, Wang Y, Tian Q. Multiscale Spatio-Temporal Graph Neural Networks for 3D Skeleton-Based Motion Prediction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:7760-7775. [PMID: 34506281 DOI: 10.1109/tip.2021.3108708] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We propose a multiscale spatio-temporal graph neural network (MST-GNN) to predict the future 3D skeleton-based human poses in an action-category-agnostic manner. The core of MST-GNN is a multiscale spatio-temporal graph that explicitly models the relations in motions at various spatial and temporal scales. Different from many previous hierarchical structures, our multiscale spatio-temporal graph is built in a data-adaptive fashion, which captures nonphysical, yet motion-based relations. The key module of MST-GNN is a multiscale spatio-temporal graph computational unit (MST-GCU) based on the trainable graph structure. MST-GCU embeds underlying features at individual scales and then fuses features across scales to obtain a comprehensive representation. The overall architecture of MST-GNN follows an encoder-decoder framework, where the encoder consists of a sequence of MST-GCUs to learn the spatial and temporal features of motions, and the decoder uses a graph-based attention gate recurrent unit (GA-GRU) to generate future poses. Extensive experiments are conducted to show that the proposed MST-GNN outperforms state-of-the-art methods in both short and long-term motion prediction on the datasets of Human 3.6M, CMU Mocap and 3DPW, where MST-GNN outperforms previous works by 5.33% and 3.67% of mean angle errors in average for short-term and long-term prediction on Human 3.6M, and by 11.84% and 4.71% of mean angle errors for short-term and long-term prediction on CMU Mocap, and by 1.13% of mean angle errors on 3DPW in average, respectively. We further investigate the learned multiscale graphs for interpretability.
Collapse
|
38
|
Chen X, Pang A, Yang W, Ma Y, Xu L, Yu J. SportsCap: Monocular 3D Human Motion Capture and Fine-Grained Understanding in Challenging Sports Videos. Int J Comput Vis 2021. [DOI: 10.1007/s11263-021-01486-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
39
|
Liu W, Gong M, Tang Z, Qin AK, Sheng K, Xu M. Locality preserving dense graph convolutional networks with graph context-aware node representations. Neural Netw 2021; 143:108-120. [PMID: 34116289 DOI: 10.1016/j.neunet.2021.05.031] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Revised: 03/18/2021] [Accepted: 05/28/2021] [Indexed: 11/18/2022]
Abstract
Graph convolutional networks (GCNs) have been widely used for representation learning on graph data, which can capture structural patterns on a graph via specifically designed convolution and readout operations. In many graph classification applications, GCN-based approaches have outperformed traditional methods. However, most of the existing GCNs are inefficient to preserve local information of graphs - a limitation that is especially problematic for graph classification. In this work, we propose a locality-preserving dense GCN with graph context-aware node representations. Specifically, our proposed model incorporates a local node feature reconstruction module to preserve initial node features into node representations, which is realized via a simple but effective encoder-decoder mechanism. To capture local structural patterns in neighborhoods representing different ranges of locality, dense connectivity is introduced to connect each convolutional layer and its corresponding readout with all previous convolutional layers. To enhance node representativeness, the output of each convolutional layer is concatenated with the output of the previous layer's readout to form a global context-aware node representation. In addition, a self-attention module is introduced to aggregate layer-wise representations to form the final graph-level representation. Experiments on benchmark datasets demonstrate the superiority of the proposed model over state-of-the-art methods in terms of classification accuracy.
Collapse
Affiliation(s)
- Wenfeng Liu
- School of Electronic Engineering, Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi'an, Shaanxi Province 710071, China
| | - Maoguo Gong
- School of Electronic Engineering, Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi'an, Shaanxi Province 710071, China.
| | - Zedong Tang
- Academy of Advanced Interdisciplinary Research, Xidian University, Xi'an, Shaanxi Province 710071, China
| | - A K Qin
- Department of Computer Science and Software Engineering, Swinburne University of Technology, Melbourne, VIC 3122, Australia
| | - Kai Sheng
- School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China
| | - Mingliang Xu
- School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China
| |
Collapse
|
40
|
Ishida S, Miyazaki T, Sugaya Y, Omachi S. Graph Neural Networks with Multiple Feature Extraction Paths for Chemical Property Estimation. Molecules 2021; 26:molecules26113125. [PMID: 34073745 PMCID: PMC8197261 DOI: 10.3390/molecules26113125] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Revised: 05/14/2021] [Accepted: 05/21/2021] [Indexed: 11/16/2022] Open
Abstract
Feature extraction is essential for chemical property estimation of molecules using machine learning. Recently, graph neural networks have attracted attention for feature extraction from molecules. However, existing methods focus only on specific structural information, such as node relationship. In this paper, we propose a novel graph convolutional neural network that performs feature extraction with simultaneously considering multiple structures. Specifically, we propose feature extraction paths specialized in node, edge, and three-dimensional structures. Moreover, we propose an attention mechanism to aggregate the features extracted by the paths. The attention aggregation enables us to select useful features dynamically. The experimental results showed that the proposed method outperformed previous methods.
Collapse
|
41
|
Sun Y, Huang H, Yun X, Yang B, Dong K. Triplet attention multiple spacetime-semantic graph convolutional network for skeleton-based action recognition. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02370-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
42
|
Nan M, Trăscău M, Florea AM, Iacob CC. Comparison between Recurrent Networks and Temporal Convolutional Networks Approaches for Skeleton-Based Action Recognition. SENSORS 2021; 21:s21062051. [PMID: 33803929 PMCID: PMC8001872 DOI: 10.3390/s21062051] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 03/10/2021] [Accepted: 03/12/2021] [Indexed: 11/16/2022]
Abstract
Action recognition plays an important role in various applications such as video monitoring, automatic video indexing, crowd analysis, human-machine interaction, smart homes and personal assistive robotics. In this paper, we propose improvements to some methods for human action recognition from videos that work with data represented in the form of skeleton poses. These methods are based on the most widely used techniques for this problem—Graph Convolutional Networks (GCNs), Temporal Convolutional Networks (TCNs) and Recurrent Neural Networks (RNNs). Initially, the paper explores and compares different ways to extract the most relevant spatial and temporal characteristics for a sequence of frames describing an action. Based on this comparative analysis, we show how a TCN type unit can be extended to work even on the characteristics extracted from the spatial domain. To validate our approach, we test it against a benchmark often used for human action recognition problems and we show that our solution obtains comparable results to the state-of-the-art, but with a significant increase in the inference speed.
Collapse
|
43
|
Wang J, Liang K, Zhang N, Yao H, Ho TY, Sun L. Automated calibration of 3D-printed microfluidic devices based on computer vision. BIOMICROFLUIDICS 2021; 15:024102. [PMID: 33732409 PMCID: PMC7952140 DOI: 10.1063/5.0037274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Accepted: 02/28/2021] [Indexed: 05/02/2023]
Abstract
With the development of 3D printing techniques, the application of it in microfluidic/Lab-on-a-Chip (LoC) fabrication is becoming more and more attractive. However, to achieve a satisfying printing quality of the target devices, researchers usually require quite an amount of work in calibration trials even for high-end 3D printers. To increase the calibration efficiency of the average priced printers and promote the application of 3D printing technology in the microfluidic community, this work has presented a computer vision (CV)-based method for rapid and precise 3D printing calibration with examples on cylindrical hole/post diameters of 0.2-2.4 mm and rectangular hole/post widths of 0.2-1.0 mm by a stereolithography-based 3D printer. Our method is fully automated, which contains five steps and only needs a camera at hand to provide photos for convolutional neural network recognition. The experimental results showed that our CV-based method could provide calibrated dimensions with just one print of the specific calibration ruler to meet user desire. The higher resolution of the photo provides a higher precision in calibration. Subsequently, only one more print for the target device is needed after the calibration process. Overall, this work has provided a quick and precise calibration tool for researchers to apply 3D printing in the fabrication of their microfluidic/LoC devices with average price printers. Besides, with our open source calibration software and calibration ruler design file, researchers can modify the specific setting based on customized needs and conduct calibration on any type of 3D printer.
Collapse
Affiliation(s)
- Junchao Wang
- Key Laboratory of RF Circuits and Systems, Ministry of Education, Hangzhou Dianzi University, Hangzhou 310018, China
- Author to whom correspondence should be addressed:
| | - Kaicong Liang
- Key Laboratory of RF Circuits and Systems, Ministry of Education, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Naiyin Zhang
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Hailong Yao
- Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Tsung-Yi Ho
- Department of Computer Science, National Tsing Hua University, Hsinchu 30071, Taiwan
| | - Lingling Sun
- Key Laboratory of RF Circuits and Systems, Ministry of Education, Hangzhou Dianzi University, Hangzhou 310018, China
| |
Collapse
|
44
|
Predicting Intentions of Pedestrians from 2D Skeletal Pose Sequences with a Representation-Focused Multi-Branch Deep Learning Network. ALGORITHMS 2020. [DOI: 10.3390/a13120331] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Understanding the behaviors and intentions of humans is still one of the main challenges for vehicle autonomy. More specifically, inferring the intentions and actions of vulnerable actors, namely pedestrians, in complex situations such as urban traffic scenes remains a difficult task and a blocking point towards more automated vehicles. Answering the question “Is the pedestrian going to cross?” is a good starting point in order to advance in the quest to the fifth level of autonomous driving. In this paper, we address the problem of real-time discrete intention prediction of pedestrians in urban traffic environments by linking the dynamics of a pedestrian’s skeleton to an intention. Hence, we propose SPI-Net (Skeleton-based Pedestrian Intention network): a representation-focused multi-branch network combining features from 2D pedestrian body poses for the prediction of pedestrians’ discrete intentions. Experimental results show that SPI-Net achieved 94.4% accuracy in pedestrian crossing prediction on the JAAD data set while being efficient for real-time scenarios since SPI-Net can reach around one inference every 0.25 ms on one GPU (i.e., RTX 2080ti), or every 0.67 ms on one CPU (i.e., Intel Core i7 8700K).
Collapse
|
45
|
Abstract
Facial emotion recognition (FER) has been an active research topic in the past several years. One of difficulties in FER is the effective capture of geometrical and temporary information from landmarks. In this paper, we propose a graph convolution neural network that utilizes landmark features for FER, which we called a directed graph neural network (DGNN). Nodes in the graph structure were defined by landmarks, and edges in the directed graph were built by the Delaunay method. By using graph neural networks, we could capture emotional information through faces’ inherent properties, like geometrical and temporary information. Also, in order to prevent the vanishing gradient problem, we further utilized a stable form of a temporal block in the graph framework. Our experimental results proved the effectiveness of the proposed method for datasets such as CK+ (96.02%), MMI (69.4%), and AFEW (32.64%). Also, a fusion network using image information as well as landmarks, is presented and investigated for the CK+ (98.47% performance) and AFEW (50.65% performance) datasets.
Collapse
|