1
|
Wang Q, Hu Q, Gao Z, Li P, Hu Q. AMS-Net: Modeling Adaptive Multi-Granularity Spatio-Temporal Cues for Video Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18731-18745. [PMID: 37824318 DOI: 10.1109/tnnls.2023.3321141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/14/2023]
Abstract
Effective spatio-temporal modeling as a core of video representation learning is challenged by complex scale variations in spatio-temporal cues in videos, especially different visual tempos of actions and varying spatial sizes of moving objects. Most of the existing works handle complex spatio-temporal scale variations based on input-level or feature-level pyramid mechanisms, which, however, rely on expensive multistream architectures or explore multiscale spatio-temporal features in a fixed manner. To effectively capture complex scale dynamics of spatio-temporal cues in an efficient way, this article proposes a single-stream architecture (SS-Arch.) with single-input [namely, adaptive multi-granularity spatio-temporal network (AMS-Net)] to model adaptive multi-granularity (Multi-Gran.) Spatio-temporal cues for video action recognition. To this end, our AMS-Net proposes two core components, namely, competitive progressive temporal modeling (CPTM) block and collaborative spatio-temporal pyramid (CSTP) module. They, respectively, capture fine-grained temporal cues and fuse coarse-level spatio-temporal features in an adaptive manner. It admits that AMS-Net can handle subtle variations in visual tempos and fair-sized spatio-temporal dynamics in a unified architecture. Note that our AMS-Net can be flexibly instantiated based on existing deep convolutional neural networks (CNNs) with the proposed CPTM block and CSTP module. The experiments are conducted on eight video benchmarks, and the results show our AMS-Net establishes state-of-the-art (SOTA) performance on fine-grained action recognition (i.e., Diving48 and FineGym), while performing very competitively on widely used Something-Something and Kinetics.
Collapse
|
2
|
Duan W, Xuan J, Qiao M, Lu J. Graph Convolutional Neural Networks With Diverse Negative Samples via Decomposed Determinant Point Processes. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18160-18171. [PMID: 37725742 DOI: 10.1109/tnnls.2023.3312307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/21/2023]
Abstract
Graph convolutional neural networks (GCNs) have achieved great success in graph representation learning by extracting high-level features from nodes and their topology. Since GCNs generally follow a message-passing mechanism, each node aggregates information from its first-order neighbor to update its representation. As a result, the representations of nodes with edges between them should be positively correlated and thus can be considered positive samples. However, there are more non-neighbor nodes in the whole graph, which provide diverse and useful information for the representation update. Two non-adjacent nodes usually have different representations, which can be seen as negative samples. Besides the node representations, the structural information of the graph is also crucial for learning. In this article, we used quality-diversity decomposition in determinant point processes (DPPs) to obtain diverse negative samples. When defining a distribution on diverse subsets of all non-neighboring nodes, we incorporate both graph structure information and node representations. Since the DPP sampling process requires matrix eigenvalue decomposition, we propose a new shortest-path-base method to improve computational efficiency. Finally, we incorporate the obtained negative samples into the graph convolution operation. The ideas are evaluated empirically in experiments on node classification tasks. These experiments show that the newly proposed methods not only improve the overall performance of standard representation learning but also significantly alleviate over-smoothing problems.
Collapse
|
3
|
Dong M, Xu C. Skeleton-Based Human Motion Prediction With Privileged Supervision. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:10419-10432. [PMID: 35446772 DOI: 10.1109/tnnls.2022.3166861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Existing supervised methods have achieved impressive performance in forecasting skeleton-based human motion. However, they often rely on action class labels in both training and inference phases. In practice, it could be a burden to request action class labels in the inference phase, and even for the training phase, the collected labels could be incomplete for sequences with a mixture of multiple actions. In this article, we take action class labels as a kind of privileged supervision that only exists in the training phase. We design a new architecture that includes a motion classification as an auxiliary task with motion prediction. To deal with potential missing labels of motion sequence, we propose a new classification loss function to exploit their relationships with those observed labels and a perceptual loss to measure the difference between ground truth sequence and generated sequence in the classification task. Experimental results on the most challenging Human 3.6M dataset and the Carnegie Mellon University (CMU) dataset demonstrate the effectiveness of the proposed algorithm to exploit action class labels for improved modeling of human dynamics.
Collapse
|
4
|
Zhou S, Xu H, Bai Z, Du Z, Zeng J, Wang Y, Wang Y, Li S, Wang M, Li Y, Li J, Xu J. A multidimensional feature fusion network based on MGSE and TAAC for video-based human action recognition. Neural Netw 2023; 168:496-507. [PMID: 37827068 DOI: 10.1016/j.neunet.2023.09.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 09/14/2023] [Accepted: 09/18/2023] [Indexed: 10/14/2023]
Abstract
With the maturity of intelligent technology such as human-computer interaction, human action recognition (HAR) technology has been widely used in virtual reality, video surveillance, and other fields. However, the current video-based HAR methods still cannot fully extract abstract action features, and there is still a lack of action collection and recognition for special personnel such as prisoners and elderly people living alone. To solve the above problems, this paper proposes a multidimensional feature fusion network, called P-MTSC3D, a parallel network based on context modeling and temporal adaptive attention module. It consists of three branches. The first branch serves as the basic network branch, which extracts basic feature information. The second branch consists of a feature pre-extraction layer and two multiscale-convolution-based global context modeling combined squeeze and excitation (MGSE) modules, which can extract spatial and channel features. The third branch consists of two temporal adaptive attention units based on convolution (TAAC) to extract temporal dimension features. In order to verify the validity of the proposed network, this paper conducts experiments on the University of Central Florida (UCF) 101 dataset and the human motion database (HMDB) 51 dataset. The recognition accuracy of the proposed P-MTSC3D network is 97.92% on the UCF101 dataset and 75.59% on the HMDB51 dataset, respectively. The FLOPs of the P-MTSC3D network is 30.85G, and the test time is 2.83 s/16 samples on the UCF101 dataset. The experimental results demonstrate that the P-MTSC3D network has better overall performance than the state-of-the-art networks. In addition, a prison action (PA) dataset is constructed in this paper to verify the application effect of the proposed network in actual scenarios.
Collapse
Affiliation(s)
- Shuang Zhou
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Hongji Xu
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China.
| | - Zhiquan Bai
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China.
| | - Zhengfeng Du
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Jiaqi Zeng
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Yang Wang
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Yuhao Wang
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Shijie Li
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Mengmeng Wang
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Yiran Li
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Jianjun Li
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Jie Xu
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| |
Collapse
|
5
|
Yao Y, Yu B, Gong C, Liu T. Understanding How Pretraining Regularizes Deep Learning Algorithms. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:5828-5840. [PMID: 34890343 DOI: 10.1109/tnnls.2021.3131377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Deep learning algorithms have led to a series of breakthroughs in computer vision, acoustical signal processing, and others. However, they have only been popularized recently due to the groundbreaking techniques developed for training deep architectures. Understanding the training techniques is important if we want to further improve them. Through extensive experimentation, Erhan et al. (2010) empirically illustrated that unsupervised pretraining has an effect of regularization for deep learning algorithms. However, theoretical justifications for the observation remain elusive. In this article, we provide theoretical supports by analyzing how unsupervised pretraining regularizes deep learning algorithms. Specifically, we interpret deep learning algorithms as the traditional Tikhonov-regularized batch learning algorithms that simultaneously learn predictors in the input feature spaces and the parameters of the neural networks to produce the Tikhonov matrices. We prove that unsupervised pretraining helps in learning meaningful Tikhonov matrices, which will make the deep learning algorithms uniformly stable and the learned predictor will generalize fast w.r.t. the sample size. Unsupervised pretraining, therefore, can be interpreted as to have the function of regularization.
Collapse
|
6
|
Li J, Wei S, Dai W. Combination of Manifold Learning and Deep Learning Algorithms for Mid-Term Electrical Load Forecasting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2584-2593. [PMID: 34478386 DOI: 10.1109/tnnls.2021.3106968] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Mid-term load forecasting (MTLF) is of great significance for power system planning, operation, and power trading. However, the mid-term electrical load is affected by the coupling of multiple factors and demonstrates complex characteristics, which leads to low prediction accuracy in MTLF. Furthermore, MTLF is faced with the "curse of dimensionality" problem due to a large number of variables. This article proposes an MTLF method based on manifold learning, which can extract the underlying factors of load variations to help improve the accuracy of MTLF and significantly reduce the calculation. Unlike linear dimensionality reduction methods, manifold learning has better nonlinear feature extraction ability and is more suitable for load data with nonlinear characteristics. Furthermore, long short-term memory (LSTM) neural networks are used to establish forecasting models in the low-dimensional space obtained by manifold learning. The proposed MTLF method is tested on independent system operator (ISO) New England datasets, and load forecasting in 24, 168, and 720 h ahead is carried out. The numerical results validate that the proposed method has higher prediction accuracy than many mature methods in the mid-term time scale.
Collapse
|
7
|
Zheng Z, An G, Cao S, Wu D, Ruan Q. Collaborative and Multilevel Feature Selection Network for Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:1304-1318. [PMID: 34424850 DOI: 10.1109/tnnls.2021.3105184] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The feature pyramid has been widely used in many visual tasks, such as fine-grained image classification, instance segmentation, and object detection, and had been achieving promising performance. Although many algorithms exploit different-level features to construct the feature pyramid, they usually treat them equally and do not make an in-depth investigation on the inherent complementary advantages of different-level features. In this article, to learn a pyramid feature with the robust representational ability for action recognition, we propose a novel collaborative and multilevel feature selection network (FSNet) that applies feature selection and aggregation on multilevel features according to action context. Unlike previous works that learn the pattern of frame appearance by enhancing spatial encoding, the proposed network consists of the position selection module and channel selection module that can adaptively aggregate multilevel features into a new informative feature from both position and channel dimensions. The position selection module integrates the vectors at the same spatial location across multilevel features with positionwise attention. Similarly, the channel selection module selectively aggregates the channel maps at the same channel location across multilevel features with channelwise attention. Positionwise features with different receptive fields and channelwise features with different pattern-specific responses are emphasized respectively depending on their correlations to actions, which are fused as a new informative feature for action recognition. The proposed FSNet can be inserted into different backbone networks flexibly, and extensive experiments are conducted on three benchmark action datasets, Kinetics, UCF101, and HMDB51. Experimental results show that FSNet is practical and can be collaboratively trained to boost the representational ability of existing networks. FSNet achieves superior performance against most top-tier models on Kinetics and all models on UCF101 and HMDB51.
Collapse
|
8
|
Lin M, Ji R, Li S, Wang Y, Wu Y, Huang F, Ye Q. Network Pruning Using Adaptive Exemplar Filters. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:7357-7366. [PMID: 34101606 DOI: 10.1109/tnnls.2021.3084856] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Popular network pruning algorithms reduce redundant information by optimizing hand-crafted models, and may cause suboptimal performance and long time in selecting filters. We innovatively introduce adaptive exemplar filters to simplify the algorithm design, resulting in an automatic and efficient pruning approach called EPruner. Inspired by the face recognition community, we use a message-passing algorithm Affinity Propagation on the weight matrices to obtain an adaptive number of exemplars, which then act as the preserved filters. EPruner breaks the dependence on the training data in determining the "important" filters and allows the CPU implementation in seconds, an order of magnitude faster than GPU-based SOTAs. Moreover, we show that the weights of exemplars provide a better initialization for the fine-tuning. On VGGNet-16, EPruner achieves a 76.34%-FLOPs reduction by removing 88.80% parameters, with 0.06% accuracy improvement on CIFAR-10. In ResNet-152, EPruner achieves a 65.12%-FLOPs reduction by removing 64.18% parameters, with only 0.71% top-5 accuracy loss on ILSVRC-2012. Our code is available at https://github.com/lmbxmu/EPruner.
Collapse
|
9
|
Yan R, Shu X, Yuan C, Tian Q, Tang J. Position-Aware Participation-Contributed Temporal Dynamic Model for Group Activity Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:7574-7588. [PMID: 34138718 DOI: 10.1109/tnnls.2021.3085567] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Group activity recognition (GAR) aiming at understanding the behavior of a group of people in a video clip has received increasing attention recently. Nevertheless, most of the existing solutions ignore that not all the persons contribute to the group activity of the scene equally. That is to say, the contribution from different individual behaviors to group activity is different; meanwhile, the contribution from people with different spatial positions is also different. To this end, we propose a novel Position-aware Participation-Contributed Temporal Dynamic Model (P2CTDM), in which two types of the key actor are constructed and learned. Specifically, we focus on the behaviors of key actors, who maintain steady motions (long moving time, called long motions) or display remarkable motions (but closely related to other people and the group activity, called flash motions) at a certain moment. For capturing long motions, we rank individual motions according to their intensity measured by stacking optical flows. For capturing flash motions that are closely related to other people, we design a position-aware interaction module (PIM) that simultaneously considers the feature similarity and position information. Beyond that, for capturing flash motions that are highly related to the group activity, we also present an aggregation long short-term memory (Agg-LSTM) to fuse the outputs from PIM by time-varying trainable attention factors. Four widely used benchmarks are adopted to evaluate the performance of the proposed P2CTDM compared to the state of the art.
Collapse
|
10
|
Hao Y, Wang S, Tan Y, He X, Liu Z, Wang M. Spatio-Temporal Collaborative Module for Efficient Action Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:7279-7291. [PMID: 36378789 DOI: 10.1109/tip.2022.3221292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Efficient action recognition aims to classify a video clip into a specific action category with a low computational cost. It is challenging since the integrated spatial-temporal calculation (e. g., 3D convolution) introduces intensive operations and increases complexity. This paper explores the feasibility of the integration of channel splitting and filter decoupling for efficient architecture design and feature refinement by proposing a novel spatio-temporal collaborative (STC) module. STC splits the video feature channels into two groups and separately learns spatio-temporal representations in parallel with decoupled convolutional operators. Particularly, STC consists of two computation-efficient blocks, i.e., [Formula: see text] and [Formula: see text], where they extract either spatial ( S· ) or temporal ( T· ) features and further refine their features with either temporal ( ·T ) or spatial ( ·S ) contexts globally. The spatial/temporal context refers to information dynamics aggregated from temporal/spatial axis. To thoroughly examine our method's performance in video action recognition tasks, we conduct extensive experiments using five video benchmark datasets requiring temporal reasoning. Experimental results show that the proposed STC networks achieve a competitive trade-off between model efficiency and effectiveness.
Collapse
|
11
|
Ghaffari M, Monneret A, Hammon H, Post C, Müller U, Frieten D, Gerbert C, Dusel G, Koch C. Deep convolutional neural networks for the detection of diarrhea and respiratory disease in preweaning dairy calves using data from automated milk feeders. J Dairy Sci 2022; 105:9882-9895. [DOI: 10.3168/jds.2021-21547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 08/04/2022] [Indexed: 11/17/2022]
|
12
|
Ma Y, Bai S, Liu W, Wang S, Yu Y, Bai X, Liu X, Wang M. Transductive Relation-Propagation With Decoupling Training for Few-Shot Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:6652-6664. [PMID: 34138714 DOI: 10.1109/tnnls.2021.3082928] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Few-shot learning, aiming to learn novel concepts from one or a few labeled examples, is an interesting and very challenging problem with many practical advantages. Existing few-shot methods usually utilize data of the same classes to train the feature embedding module and in a row, which is unable to learn adapting to new tasks. Besides, traditional few-shot models fail to take advantage of the valuable relations of the support-query pairs, leading to performance degradation. In this article, we propose a transductive relation-propagation graph neural network (GNN) with a decoupling training strategy (TRPN-D) to explicitly model and propagate such relations across support-query pairs, and empower the few-shot module the ability of transferring past knowledge to new tasks via the decoupling training. Our few-shot module, namely TRPN, treats the relation of each support-query pair as a graph node, named relational node, and resorts to the known relations between support samples, including both intraclass commonality and interclass uniqueness. Through relation propagation, the model could generate the discriminative relation embeddings for support-query pairs. To the best of our knowledge, this is the first work that decouples the training of the embedding network and the few-shot graph module with different tasks, which might offer a new way to solve the few-shot learning problem. Extensive experiments conducted on several benchmark datasets demonstrate that our method can significantly outperform a variety of state-of-the-art few-shot learning methods.
Collapse
|
13
|
Upadhya V, Sastry PS. Learning Gaussian-Bernoulli RBMs Using Difference of Convex Functions Optimization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5728-5738. [PMID: 33857001 DOI: 10.1109/tnnls.2021.3071358] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The Gaussian-Bernoulli restricted Boltzmann machine (GB-RBM) is a useful generative model that captures meaningful features from the given n -dimensional continuous data. The difficulties associated with learning GB-RBM are reported extensively in earlier studies. They indicate that the training of the GB-RBM using the current standard algorithms, namely contrastive divergence (CD) and persistent contrastive divergence (PCD), needs a carefully chosen small learning rate to avoid divergence which, in turn, results in slow learning. In this work, we alleviate such difficulties by showing that the negative log-likelihood for a GB-RBM can be expressed as a difference of convex functions if we keep the variance of the conditional distribution of visible units (given hidden unit states) and the biases of the visible units, constant. Using this, we propose a stochastic difference of convex (DC) functions programming (S-DCP) algorithm for learning the GB-RBM. We present extensive empirical studies on several benchmark data sets to validate the performance of this S-DCP algorithm. It is seen that S-DCP is better than the CD and PCD algorithms in terms of speed of learning and the quality of the generative model learned.
Collapse
|
14
|
Wang Y, Xiao Y, Lu J, Tan B, Cao Z, Zhang Z, Zhou JT. Discriminative Multi-View Dynamic Image Fusion for Cross-View 3-D Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5332-5345. [PMID: 33852396 DOI: 10.1109/tnnls.2021.3070179] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Dramatic imaging viewpoint variation is the critical challenge toward action recognition for depth video. To address this, one feasible way is to enhance view-tolerance of visual feature, while still maintaining strong discriminative capacity. Multi-view dynamic image (MVDI) is the most recently proposed 3-D action representation manner that is able to compactly encode human motion information and 3-D visual clue well. However, it is still view-sensitive. To leverage its performance, a discriminative MVDI fusion method is proposed by us via multi-instance learning (MIL). Specifically, the dynamic images (DIs) from different observation viewpoints are regarded as the instances for 3-D action characterization. After being encoded using Fisher vector (FV), they are then aggregated by sum-pooling to yield the representative 3-D action signature. Our insight is that viewpoint aggregation helps to enhance view-tolerance. And, FV can map the raw DI feature to the higher dimensional feature space to promote the discriminative power. Meanwhile, a discriminative viewpoint instance discovery method is also proposed to discard the viewpoint instances unfavorable for action characterization. The wide-range experiments on five data sets demonstrate that our proposition can significantly enhance the performance of cross-view 3-D action recognition. And, it is also applicable to cross-view 3-D object recognition. The source code is available at https://github.com/3huo/ActionView.
Collapse
|
15
|
Abstract
Melanoma skin cancer is one of the most dangerous types of skin cancer, which, if not diagnosed early, may lead to death. Therefore, an accurate diagnosis is needed to detect melanoma. Traditionally, a dermatologist utilizes a microscope to inspect and then provide a report on a biopsy for diagnosis; however, this diagnosis process is not easy and requires experience. Hence, there is a need to facilitate the diagnosis process while still yielding an accurate diagnosis. For this purpose, artificial intelligence techniques can assist the dermatologist in carrying out diagnosis. In this study, we considered the detection of melanoma through deep learning based on cutaneous image processing. For this purpose, we tested several convolutional neural network (CNN) architectures, including DenseNet201, MobileNetV2, ResNet50V2, ResNet152V2, Xception, VGG16, VGG19, and GoogleNet, and evaluated the associated deep learning models on graphical processing units (GPUs). A dataset consisting of 7146 images was processed using these models, and we compared the obtained results. The experimental results showed that GoogleNet can obtain the highest performance accuracy on both the training and test sets (74.91% and 76.08%, respectively).
Collapse
|
16
|
Wang R, Wu XJ, Kittler J. SymNet: A Simple Symmetric Positive Definite Manifold Deep Learning Method for Image Set Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2208-2222. [PMID: 33784627 DOI: 10.1109/tnnls.2020.3044176] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
By representing each image set as a nonsingular covariance matrix on the symmetric positive definite (SPD) manifold, visual classification with image sets has attracted much attention. Despite the success made so far, the issue of large within-class variability of representations still remains a key challenge. Recently, several SPD matrix learning methods have been proposed to assuage this problem by directly constructing an embedding mapping from the original SPD manifold to a lower dimensional one. The advantage of this type of approach is that it cannot only implement discriminative feature selection but also preserve the Riemannian geometrical structure of the original data manifold. Inspired by this fact, we propose a simple SPD manifold deep learning network (SymNet) for image set classification in this article. Specifically, we first design SPD matrix mapping layers to map the input SPD matrices into new ones with lower dimensionality. Then, rectifying layers are devised to activate the input matrices for the purpose of forming a valid SPD manifold, chiefly to inject nonlinearity for SPD matrix learning with two nonlinear functions. Afterward, we introduce pooling layers to further compress the input SPD matrices, and the log-map layer is finally exploited to embed the resulting SPD matrices into the tangent space via log-Euclidean Riemannian computing, such that the Euclidean learning applies. For SymNet, the (2-D)2principal component analysis (PCA) technique is utilized to learn the multistage connection weights without requiring complicated computations, thus making it be built and trained easier. On the tail of SymNet, the kernel discriminant analysis (KDA) algorithm is coupled with the output vectorized feature representations to perform discriminative subspace learning. Extensive experiments and comparisons with state-of-the-art methods on six typical visual classification tasks demonstrate the feasibility and validity of the proposed SymNet.
Collapse
|
17
|
Gao Z, Guo L, Ren T, Liu AA, Cheng ZY, Chen S. Pairwise Two-Stream ConvNets for Cross-Domain Action Recognition With Small Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:1147-1161. [PMID: 33296313 DOI: 10.1109/tnnls.2020.3041018] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this work, we target cross-domain action recognition (CDAR) in the video domain and propose a novel end-to-end pairwise two-stream ConvNets (PTC) algorithm for real-life conditions, in which only a few labeled samples are available. To cope with the limited training sample problem, we employ pairwise network architecture that can leverage training samples from a source domain and, thus, requires only a few labeled samples per category from the target domain. In particular, a frame self-attention mechanism and an adaptive weight scheme are embedded into the PTC network to adaptively combine the RGB and flow features. This design can effectively learn domain-invariant features for both the source and target domains. In addition, we propose a sphere boundary sample-selecting scheme that selects the training samples at the boundary of a class (in the feature space) to train the PTC model. In this way, a well-enhanced generalization capability can be achieved. To validate the effectiveness of our PTC model, we construct two CDAR data sets (SDAI Action I and SDAI Action II) that include indoor and outdoor environments; all actions and samples in these data sets were carefully collected from public action data sets. To the best of our knowledge, these are the first data sets specifically designed for the CDAR task. Extensive experiments were conducted on these two data sets. The results show that PTC outperforms state-of-the-art video action recognition methods in terms of both accuracy and training efficiency. It is noteworthy that when only two labeled training samples per category are used in the SDAI Action I data set, PTC achieves 21.9% and 6.8% improvement in accuracy over two-stream and temporal segment networks models, respectively. As an added contribution, the SDAI Action I and SDAI Action II data sets will be released to facilitate future research on the CDAR task.
Collapse
|
18
|
Sahani M, Dash PK. FPGA-Based Semisupervised Multifusion RDCNN of Process Robust VMD Data With Online Kernel RVFLN for Power Quality Events Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:515-527. [PMID: 33074830 DOI: 10.1109/tnnls.2020.3027984] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The improved particle swarm optimization algorithm is integrated with variational mode decomposition (VMD) to extract the efficient band-limited intrinsic mode function (BLIMF) of the single and combined power quality events (PQEs). The selected BLIMF of the robust VMD (RVMD) and the privileged Fourier magnitude spectrum (FMS) information are fed to the proposed reduced deep convolutional neural network (RDCNN) for the extraction of the most discriminative unsupervised features. The RVMD-FMS-RDCNN method shows minimum feature overlapping compared with RDCNN and RVMD-RDCNN methods. The feature vector is imported to the novel supervised online kernel random vector functional link network (OKRVFLN) for quick and accurate categorization of complex PQEs. The proposed RVMD-FMS-RDCNN-OKRVFLN method produces excellent recognition capability over RDCNN, RVMD-RDCNN, and RVMD-RDCNN-OKRVFLN methods in noise-free and noisy environments. The unique BLIMF selection, clear detection, descriptive feature extraction, higher learning speed, superior classification accuracy, and robust antinoise performances are considerable importance of the proposed RVMD-FMS-RDCNN-OKRVFLN method. Finally, the proposed method architecture is developed and implemented in a very-high-speed ML506 Virtex-5 FPGA to text, examine, and validate the feasibility, performances, and practicability for online monitoring of the PQEs.
Collapse
|
19
|
AAU-Net: Attention-Based Asymmetric U-Net for Subject-Sensitive Hashing of Remote Sensing Images. REMOTE SENSING 2021. [DOI: 10.3390/rs13245109] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The prerequisite for the use of remote sensing images is that their security must be guaranteed. As a special subset of perceptual hashing, subject-sensitive hashing overcomes the shortcomings of the existing perceptual hashing that cannot distinguish between “subject-related tampering” and “subject-unrelated tampering” of remote sensing images. However, the existing subject-sensitive hashing still has a large deficiency in robustness. In this paper, we propose a novel attention-based asymmetric U-Net (AAU-Net) for the subject-sensitive hashing of remote sensing (RS) images. Our AAU-Net demonstrates obvious asymmetric structure characteristics, which is important to improve the robustness of features by combining the attention mechanism and the characteristics of subject-sensitive hashing. On the basis of AAU-Net, a subject-sensitive hashing algorithm is developed to integrate the features of various bands of RS images. Our experimental results show that our AAU-Net-based subject-sensitive hashing algorithm is more robust than the existing deep learning models such as Attention U-Net and MUM-Net, and its tampering sensitivity remains at the same level as that of Attention U-Net and MUM-Net.
Collapse
|
20
|
Chumachenko K, Iosifidis A, Gabbouj M. Feedforward neural networks initialization based on discriminant learning. Neural Netw 2021; 146:220-229. [PMID: 34902796 DOI: 10.1016/j.neunet.2021.11.020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 10/08/2021] [Accepted: 11/18/2021] [Indexed: 11/16/2022]
Abstract
In this paper, a novel data-driven method for weight initialization of Multilayer Perceptrons and Convolutional Neural Networks based on discriminant learning is proposed. The approach relaxes some of the limitations of competing data-driven methods, including unimodality assumptions, limitations on the architectures related to limited maximal dimensionalities of the corresponding projection spaces, as well as limitations related to high computational requirements due to the need of eigendecomposition on high-dimensional data. We also consider assumptions of the method on the data and propose a way to account for them in a form of a new normalization layer. The experiments on three large-scale image datasets show improved accuracy of the trained models compared to competing random-based and data-driven weight initialization methods, as well as better convergence properties in certain cases.
Collapse
Affiliation(s)
- Kateryna Chumachenko
- Faculty of Information Technology and Communication Sciences, Tampere University, FI 33720, Tampere, Finland.
| | - Alexandros Iosifidis
- Department of Electrical and Computer Engineering, Aarhus University, DK 8200, Aarhus, Denmark.
| | - Moncef Gabbouj
- Faculty of Information Technology and Communication Sciences, Tampere University, FI 33720, Tampere, Finland.
| |
Collapse
|
21
|
Shao L, Zuo H, Zhang J, Xu Z, Yao J, Wang Z, Li H. Filter Pruning via Measuring Feature Map Information. SENSORS (BASEL, SWITZERLAND) 2021; 21:6601. [PMID: 34640921 PMCID: PMC8512244 DOI: 10.3390/s21196601] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Revised: 09/15/2021] [Accepted: 09/28/2021] [Indexed: 11/17/2022]
Abstract
Neural network pruning, an important method to reduce the computational complexity of deep models, can be well applied to devices with limited resources. However, most current methods focus on some kind of information about the filter itself to prune the network, rarely exploring the relationship between the feature maps and the filters. In this paper, two novel pruning methods are proposed. First, a new pruning method is proposed, which reflects the importance of filters by exploring the information in the feature maps. Based on the premise that the more information there is, more important the feature map is, the information entropy of feature maps is used to measure information, which is used to evaluate the importance of each filter in the current layer. Further, normalization is used to realize cross layer comparison. As a result, based on the method mentioned above, the network structure is efficiently pruned while its performance is well reserved. Second, we proposed a parallel pruning method using the combination of our pruning method above and slimming pruning method which has better results in terms of computational cost. Our methods perform better in terms of accuracy, parameters, and FLOPs compared to most advanced methods. On ImageNet, it is achieved 72.02% top1 accuracy for ResNet50 with merely 11.41M parameters and 1.12B FLOPs.For DenseNet40, it is obtained 94.04% accuracy with only 0.38M parameters and 110.72M FLOPs on CIFAR10, and our parallel pruning method makes the parameters and FLOPs are just 0.37M and 100.12M, respectively, with little loss of accuracy.
Collapse
Affiliation(s)
- Linsong Shao
- Key Laboratory of Optical Engineering, Chinese Academy of Sciences, Chengdu 610200, China;
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610200, China; (J.Z.); (Z.X.); (J.Y.); (Z.W.); (H.L.)
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Haorui Zuo
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610200, China; (J.Z.); (Z.X.); (J.Y.); (Z.W.); (H.L.)
| | - Jianlin Zhang
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610200, China; (J.Z.); (Z.X.); (J.Y.); (Z.W.); (H.L.)
| | - Zhiyong Xu
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610200, China; (J.Z.); (Z.X.); (J.Y.); (Z.W.); (H.L.)
| | - Jinzhen Yao
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610200, China; (J.Z.); (Z.X.); (J.Y.); (Z.W.); (H.L.)
| | - Zhixing Wang
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610200, China; (J.Z.); (Z.X.); (J.Y.); (Z.W.); (H.L.)
| | - Hong Li
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610200, China; (J.Z.); (Z.X.); (J.Y.); (Z.W.); (H.L.)
| |
Collapse
|
22
|
Hi-EADN: Hierarchical Excitation Aggregation and Disentanglement Frameworks for Action Recognition Based on Videos. Symmetry (Basel) 2021. [DOI: 10.3390/sym13040662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Most existing video action recognition methods mainly rely on high-level semantic information from convolutional neural networks (CNNs) but ignore the discrepancies of different information streams. However, it does not normally consider both long-distance aggregations and short-range motions. Thus, to solve these problems, we propose hierarchical excitation aggregation and disentanglement networks (Hi-EADNs), which include multiple frame excitation aggregation (MFEA) and a feature squeeze-and-excitation hierarchical disentanglement (SEHD) module. MFEA specifically uses long-short range motion modelling and calculates the feature-level temporal difference. The SEHD module utilizes these differences to optimize the weights of each spatiotemporal feature and excite motion-sensitive channels. Moreover, without introducing additional parameters, this feature information is processed with a series of squeezes and excitations, and multiple temporal aggregations with neighbourhoods can enhance the interaction of different motion frames. Extensive experimental results confirm our proposed Hi-EADN method effectiveness on the UCF101 and HMDB51 benchmark datasets, where the top-5 accuracy is 93.5% and 76.96%.
Collapse
|
23
|
Zheng Z, An G, Wu D, Ruan Q. Global and Local Knowledge-Aware Attention Network for Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:334-347. [PMID: 32224465 DOI: 10.1109/tnnls.2020.2978613] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.
Collapse
|
24
|
Jung HG, Lee SW. Few-Shot Learning With Geometric Constraints. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:4660-4672. [PMID: 31902774 DOI: 10.1109/tnnls.2019.2957187] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this article, we consider the problem of few-shot learning for classification. We assume a network trained for base categories with a large number of training examples, and we aim to add novel categories to it that have only a few, e.g., one or five, training examples. This is a challenging scenario because: 1) high performance is required in both the base and novel categories; and 2) training the network for the new categories with a few training examples can contaminate the feature space trained well for the base categories. To address these challenges, we propose two geometric constraints to fine-tune the network with a few training examples. The first constraint enables features of the novel categories to cluster near the category weights, and the second maintains the weights of the novel categories far from the weights of the base categories. By applying the proposed constraints, we extract discriminative features for the novel categories while preserving the feature space learned for the base categories. Using public data sets for few-shot learning that are subsets of ImageNet, we demonstrate that the proposed method outperforms prevalent methods by a large margin.
Collapse
|
25
|
Gao Z, Wu Y, Harandi M, Jia Y. A Robust Distance Measure for Similarity-Based Classification on the SPD Manifold. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:3230-3244. [PMID: 31567102 DOI: 10.1109/tnnls.2019.2939177] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The symmetric positive definite (SPD) matrices, forming a Riemannian manifold, are commonly used as visual representations. The non-Euclidean geometry of the manifold often makes developing learning algorithms (e.g., classifiers) difficult and complicated. The concept of similarity-based learning has been shown to be effective to address various problems on SPD manifolds. This is mainly because the similarity-based algorithms are agnostic to the geometry and purely work based on the notion of similarities/distances. However, existing similarity-based models on SPD manifolds opt for holistic representations, ignoring characteristics of information captured by SPD matrices. To circumvent this limitation, we propose a novel SPD distance measure for the similarity-based algorithm. Specifically, we introduce the concept of point-to-set transformation, which enables us to learn multiple lower dimensional and discriminative SPD manifolds from a higher dimensional one. For lower dimensional SPD manifolds obtained by the point-to-set transformation, we propose a tailored set-to-set distance measure by making use of the family of alpha-beta divergences. We further propose to learn the point-to-set transformation and the set-to-set distance measure jointly, yielding a powerful similarity-based algorithm on SPD manifolds. Our thorough evaluations on several visual recognition tasks (e.g., action classification and face recognition) suggest that our algorithm comfortably outperforms various state-of-the-art algorithms.
Collapse
|
26
|
Zhang X, Xu C, Tian X, Tao D. Graph Edge Convolutional Neural Networks for Skeleton-Based Action Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:3047-3060. [PMID: 31722488 DOI: 10.1109/tnnls.2019.2935173] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/18/2023]
Abstract
Body joints, directly obtained from a pose estimation model, have proven effective for action recognition. Existing works focus on analyzing the dynamics of human joints. However, except joints, humans also explore motions of limbs for understanding actions. Given this observation, we investigate the dynamics of human limbs for skeleton-based action recognition. Specifically, we represent an edge in a graph of a human skeleton by integrating its spatial neighboring edges (for encoding the cooperation between different limbs) and its temporal neighboring edges (for achieving the consistency of movements in an action). Based on this new edge representation, we devise a graph edge convolutional neural network (CNN). Considering the complementarity between graph node convolution and edge convolution, we further construct two hybrid networks by introducing different shared intermediate layers to integrate graph node and edge CNNs. Our contributions are twofold, graph edge convolution and hybrid networks for integrating the proposed edge convolution and the conventional node convolution. Experimental results on the Kinetics and NTU-RGB+D data sets demonstrate that our graph edge convolution is effective at capturing the characteristics of actions and that our graph edge CNN significantly outperforms the existing state-of-the-art skeleton-based action recognition methods.
Collapse
|
27
|
Wu P, Liu J, Shen F. A Deep One-Class Neural Network for Anomalous Event Detection in Complex Scenes. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2609-2622. [PMID: 31494560 DOI: 10.1109/tnnls.2019.2933554] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
How to build a generic deep one-class (DeepOC) model to solve one-class classification problems for anomaly detection, such as anomalous event detection in complex scenes? The characteristics of existing one-class labels lead to a dilemma: it is hard to directly use a multiple classifier based on deep neural networks to solve one-class classification problems. Therefore, in this article, we propose a novel DeepOC neural network, termed as DeepOC, which can simultaneously learn compact feature representations and train a DeepOC classifier. Only with the given normal samples, we use the stacked convolutional encoder to generate their low-dimensional high-level features and train a one-class classifier to make these features as compact as possible. Meanwhile, for the sake of the correct mapping relation and the feature representations' diversity, we utilize a decoder in order to reconstruct raw samples from these low-dimensional feature representations. This structure is gradually established using an adversarial mechanism during the training stage. This mechanism is the key to our model. It organically combines two seemingly contradictory components and allows them to take advantage of each other, thus making the model robust and effective. Unlike methods that use handcrafted features or those that are separated into two stages (extracting features and training classifiers), DeepOC is a one-stage model using reliable features that are automatically extracted by neural networks. Experiments on various benchmark data sets show that DeepOC is feasible and achieves the state-of-the-art anomaly detection results compared with a dozen existing methods.
Collapse
|
28
|
Lin S, Ji R, Li Y, Deng C, Li X. Toward Compact ConvNets via Structure-Sparsity Regularized Filter Pruning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:574-588. [PMID: 30990448 DOI: 10.1109/tnnls.2019.2906563] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The success of convolutional neural networks (CNNs) in computer vision applications has been accompanied by a significant increase of computation and memory costs, which prohibits their usage on resource-limited environments, such as mobile systems or embedded devices. To this end, the research of CNN compression has recently become emerging. In this paper, we propose a novel filter pruning scheme, termed structured sparsity regularization (SSR), to simultaneously speed up the computation and reduce the memory overhead of CNNs, which can be well supported by various off-the-shelf deep learning libraries. Concretely, the proposed scheme incorporates two different regularizers of structured sparsity into the original objective function of filter pruning, which fully coordinates the global output and local pruning operations to adaptively prune filters. We further propose an alternative updating with Lagrange multipliers (AULM) scheme to efficiently solve its optimization. AULM follows the principle of alternating direction method of multipliers (ADMM) and alternates between promoting the structured sparsity of CNNs and optimizing the recognition loss, which leads to a very efficient solver ( 2.5× to the most recent work that directly solves the group sparsity-based regularization). Moreover, by imposing the structured sparsity, the online inference is extremely memory-light since the number of filters and the output feature maps are simultaneously reduced. The proposed scheme has been deployed to a variety of state-of-the-art CNN structures, including LeNet, AlexNet, VGGNet, ResNet, and GoogLeNet, over different data sets. Quantitative results demonstrate that the proposed scheme achieves superior performance over the state-of-the-art methods. We further demonstrate the proposed compression scheme for the task of transfer learning, including domain adaptation and object detection, which also show exciting performance gains over the state-of-the-art filter pruning methods.
Collapse
|
29
|
Fan H, Pei J, Zhao Y. An optimized probabilistic neural network with unit hyperspherical crown mapping and adaptive kernel coverage. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.09.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
30
|
Passalis N, Tefas A. Training Lightweight Deep Convolutional Neural Networks Using Bag-of-Features Pooling. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:1705-1715. [PMID: 30369453 DOI: 10.1109/tnnls.2018.2872995] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Convolutional neural networks (CNNs) are predominantly used for several challenging computer vision tasks achieving state-of-the-art performance. However, CNNs are complex models that require the use of powerful hardware, both for training and deploying them. To this end, a quantization-based pooling method is proposed in this paper. The proposed method is inspired from the bag-of-features model and can be used for learning more lightweight deep neural networks. Trainable radial basis function neurons are used to quantize the activations of the final convolutional layer, reducing the number of parameters in the network and allowing for natively classifying images of various sizes. The proposed method employs differentiable quantization and aggregation layers leading to an end-to-end trainable CNN architecture. Furthermore, a fast linear variant of the proposed method is introduced and discussed, providing new insight for understanding convolutional neural architectures. The ability of the proposed method to reduce the size of CNNs and increase the performance over other competitive methods is demonstrated using seven data sets and three different learning tasks (classification, regression, and retrieval).
Collapse
|
31
|
Li C, Zhang B, Chen C, Ye Q, Han J, Guo G, Ji R. Deep Manifold Structure Transfer for Action Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:4646-4658. [PMID: 31034413 DOI: 10.1109/tip.2019.2912357] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
While intrinsic data structure in subspace provides useful information for visual recognition, it has not yet been well studied in deep feature learning for action recognition. In this paper, we introduce a new spatio-temporal manifold network (STMN) that leverages data manifold structures to regularize deep action feature learning, aiming at simultaneously minimizing the intra-class variations of learned deep features and alleviating the over-fitting problem. To this end, the manifold prior is imposed from the top layer of a convolutional neural network (CNN), and is propagated across convolutional layers during forward-backward propagation. The observed correspondence of manifold structures in the data space and feature space validates that the manifold priori can be transferred across CNN layers. STMN theoretically recasts the problem of transferring the data structure prior into the deep learning architectures as a projection over the manifold via an embedding method, which can be easily solved by an Alternating Direction Method of Multipliers and Backward Propagation (ADMM-BP) algorithm. STMN is generic in the sense that it can be plugged into various backbone architectures to learn more discriminative representation for action recognition. Extensive experimental results show that our method achieves comparable or even better performance as compared with the state-of-the-art approaches on four benchmark datasets.
Collapse
|