1
|
Xu L, Bennamoun M, Boussaid F, Ouyang W, Sohel F, Xu D. Auxiliary Tasks Enhanced Dual-Affinity Learning for Weakly Supervised Semantic Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5082-5096. [PMID: 38478447 DOI: 10.1109/tnnls.2024.3373566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Most existing weakly supervised semantic segmentation (WSSS) methods rely on class activation mapping (CAM) to extract coarse class-specific localization maps using image-level labels. Prior works have commonly used an off-line heuristic thresholding process that combines the CAM maps with off-the-shelf saliency maps produced by a general pretrained saliency model to produce more accurate pseudo-segmentation labels. We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from these saliency maps and the significant intertask correlation between saliency detection and semantic segmentation. In the proposed AuxSegNet+, saliency detection and multilabel image classification are used as auxiliary tasks to improve the primary task of semantic segmentation with only image-level ground-truth labels. We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps. In particular, we propose a cross-task dual-affinity learning module to learn both pairwise and unary affinities, which are used to enhance the task-specific features and predictions by aggregating both query-dependent and query-independent global context for both saliency detection and semantic segmentation. The learned cross-task pairwise affinity can also be used to refine and propagate CAM maps to provide better pseudo labels for both tasks. Iterative improvement of segmentation performance is enabled by cross-task affinity learning and pseudo-label updating. Extensive experiments demonstrate the effectiveness of the proposed approach with new state-of-the-art WSSS results on the challenging PASCAL VOC and MS COCO benchmarks.
Collapse
|
2
|
Huang W, Zhang C, Wu J, He X, Zhang J, Lv C. Sampling Efficient Deep Reinforcement Learning Through Preference-Guided Stochastic Exploration. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18553-18564. [PMID: 37788189 DOI: 10.1109/tnnls.2023.3317628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Stochastic exploration is the key to the success of the deep -network (DQN) algorithm. However, most existing stochastic exploration approaches either explore actions heuristically regardless of their values or couple the sampling with values, which inevitably introduce bias into the learning process. In this article, we propose a novel preference-guided -greedy exploration algorithm that can efficiently facilitate exploration for DQN without introducing additional bias. Specifically, we design a dual architecture consisting of two branches, one of which is a copy of DQN, namely, the branch. The other branch, which we call the preference branch, learns the action preference that the DQN implicitly follows. We theoretically prove that the policy improvement theorem holds for the preference-guided -greedy policy and experimentally show that the inferred action preference distribution aligns with the landscape of corresponding values. Intuitively, the preference-guided -greedy exploration motivates the DQN agent to take diverse actions, so that actions with larger values can be sampled more frequently, and those with smaller values still have a chance to be explored, thus encouraging the exploration. We comprehensively evaluate the proposed method by benchmarking it with well-known DQN variants in nine different environments. Extensive results confirm the superiority of our proposed method in terms of performance and convergence speed.
Collapse
|
3
|
Zhang K, Chen C, Yuan C, Chen S, Wang X, He X. PatchNet: Maximize the Exploration of Congeneric Semantics for Weakly Supervised Semantic Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:10984-10995. [PMID: 37314912 DOI: 10.1109/tnnls.2023.3246109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
With the increase in the number of image data and the lack of corresponding labels, weakly supervised learning has drawn a lot of attention recently in computer vision tasks, especially in the fine-grained semantic segmentation problem. To alleviate human efforts from expensive pixel-by-pixel annotations, our method focuses on weakly supervised semantic segmentation (WSSS) with image-level labels, which are much easier to obtain. As a considerable gap exists between pixel-level segmentation and image-level labels, how to reflect the image-level semantic information on each pixel is an important question. To explore the congeneric semantic regions from the same class to the maximum, we construct the patch-level semantic augmentation network (PatchNet) based on the self-detected patches from different images that contain the same class labels. Patches can frame the objects as much as possible and include as little background as possible. The patch-level semantic augmentation network that is established with patches as the nodes can maximize the mutual learning of similar objects. We regard the embedding vectors of patches as nodes and use a transformer-based complementary learning module to construct weighted edges according to the embedding similarity between different nodes. Moreover, to better supplement semantic information, we propose softcomplementary loss functions matched with the whole network structure. We conduct experiments on the popular PASCAL VOC 2012 and MS COCO 2014 benchmarks, and our model yields the state-of-the-art performance.
Collapse
|
4
|
Wu Z, Xu Y, Yang J, Li X. Misclassification in Weakly Supervised Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3413-3427. [PMID: 38787668 DOI: 10.1109/tip.2024.3402981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2024]
Abstract
Weakly supervised object detection (WSOD) aims to train detectors using only image-category labels. Current methods typically first generate dense class-agnostic proposals and then select objects based on the classification scores of these proposals. These methods mainly focus on selecting the proposal having high Intersection-over-Union with the true object location, while ignoring the problem of misclassification, which occurs when some proposals exhibit semantic similarities with objects from other categories due to viewing perspective and background interference. We observe that the positive class that is misclassified typically has the following two characteristics: 1) It is usually misclassified as one or a few specific negative classes, and the scores of these negative classes are high; 2) Compared to other negative classes, the score of the positive class is relatively high. Based on these two characteristics, we propose misclassification correction (MCC) and misclassification tolerance (MCT) respectively. In MCC, we establish a misclassification memory bank to record and summarize the class-pairs with high frequencies of potential misclassifications in the early stage of training, that is, cases where the score of a negative class is significantly higher than that of the positive class. In the later stage of training, when such cases occur and correspond to the summarized class-pairs, we select the top-scoring negative class proposal as the positive training example. In MCT, we decrease the loss weights of misclassified classes in the later stage of training to avoid them dominating training and causing misclassification of objects from other classes that are semantically similar to them during inference. Extensive experiments on the PASCAL VOC and MS COCO demonstrate our method can alleviate the problem of misclassification and achieve the state-of-the-art results.
Collapse
|
5
|
Chen J, Guo Z, Li H, Chen CLP. Regularizing Scale-Adaptive Central Moment Sharpness for Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:6452-6466. [PMID: 36215387 DOI: 10.1109/tnnls.2022.3210045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
In deep learning, finding flat minima of loss function is a hot research topic in improving generalization. The existing methods usually find flat minima by sharpness minimization algorithms. However, these methods suffer from insufficient flexibility for optimization and generalization due to their ignorance of loss value. This article theoretically and experimentally explores the sharpness minimization algorithms for neural networks. First, a novel scale-invariant sharpness which is called scale-adaptive central moment sharpness (SA-CMS) is proposed. This sharpness is not only scale-invariant but can characterize the nature of loss surface clearly. Based on the proposed sharpness, this article further derives a new regularization term by integrating the different orders of the sharpness. Particularly, a host of sharpness minimization functions such as local entropy can be covered by this regularization term. Then the central moment sharpness generating function is introduced as a new objective function. Moreover, theoretical analyses indicate that the new objective function has a smoother landscape and prefer converging to flat local minima. Furthermore, a computationally efficient two-stage algorithm is developed to minimize the objective function. Compared with other algorithms, the two-stage loss-sharpness minimization (TSLSM) algorithm offers a more flexible optimization target for different training stages. On a variety of learning tasks with both small and large batch sizes, this algorithm is more universal and effective, and meanwhile achieves or surpasses the generalization performance of the state-of-the-art sharpness minimization algorithms.
Collapse
|
6
|
Ye S, Peng Q, Sun W, Xu J, Wang Y, You X, Cheung YM. Discriminative Suprasphere Embedding for Fine-Grained Visual Categorization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5092-5102. [PMID: 36107889 DOI: 10.1109/tnnls.2022.3202534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Despite the great success of the existing work in fine-grained visual categorization (FGVC), there are still several unsolved challenges, e.g., poor interpretation and vagueness contribution. To circumvent this drawback, motivated by the hypersphere embedding method, we propose a discriminative suprasphere embedding (DSE) framework, which can provide intuitive geometric interpretation and effectively extract discriminative features. Specifically, DSE consists of three modules. The first module is a suprasphere embedding (SE) block, which learns discriminative information by emphasizing weight and phase. The second module is a phase activation map (PAM) used to analyze the contribution of local descriptors to the suprasphere feature representation, which uniformly highlights the object region and exhibits remarkable object localization capability. The last module is a class contribution map (CCM), which quantitatively analyzes the network classification decision and provides insight into the domain knowledge about classified objects. Comprehensive experiments on three benchmark datasets demonstrate the effectiveness of our proposed method in comparison with state-of-the-art methods.
Collapse
|
7
|
Zhang D, Guo G, Zeng W, Li L, Han J. Generalized Weakly Supervised Object Localization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5395-5406. [PMID: 36129872 DOI: 10.1109/tnnls.2022.3204337] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
With the goal of learning to localize specific object semantics using the low-cost image-level annotation, weakly supervised object localization (WSOL) has been receiving increasing attention in recent years. Although existing literatures have studied a number of major issues in this field, one important yet challenging scenario, where the test object semantics may appear in the training phase (seen categories) or never been observed before (unseen categories), is still beyond the exploration of the existing works. We define this scenario as the generalized WSOL (GWSOL) and make a pioneering effort to study it in this article. By leveraging attribute vectors to associate seen and unseen categories, we involve threefold modeling components, i.e., the class-sensitive modeling, semantic-agnostic modeling, and content-aware modeling, into a unified end-to-end learning framework. Such design enables our model to recognize and localize unconstrained object semantics, learn compact and discriminative features that could represent the potential unseen categories, and customize content-aware attribute weights to avoid localizing on misleading attribute elements. To advance this research direction, we contribute the bounding-box manual annotations to the widely used AwA2 dataset and benchmark the GWSOL methods. Comprehensive experiments demonstrate the effectiveness of our proposed learning framework and each of the considered modeling components.
Collapse
|
8
|
Ye D, Zhu T, Zhu C, Zhou W, Yu PS. Model-Based Self-Advising for Multi-Agent Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:7934-7945. [PMID: 35157599 DOI: 10.1109/tnnls.2022.3147221] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In multiagent learning, one of the main ways to improve learning performance is to ask for advice from another agent. Contemporary advising methods share a common limitation that a teacher agent can only advise a student agent if the teacher has experience with an identical state. However, in highly complex learning scenarios, such as autonomous driving, it is rare for two agents to experience exactly the same state, which makes the advice less of a learning aid and more of a one-time instruction. In these scenarios, with contemporary methods, agents do not really help each other learn, and the main outcome of their back and forth requests for advice is an exorbitant communications' overhead. In human interactions, teachers are often asked for advice on what to do in situations that students are personally unfamiliar with. In these, we generally draw from similar experiences to formulate advice. This inspired us to provide agents with the same ability when asked for advice on an unfamiliar state. Hence, we propose a model-based self-advising method that allows agents to train a model based on states similar to the state in question to inform its response. As a result, the advice given can not only be used to resolve the current dilemma but also many other similar situations that the student may come across in the future via self-advising. Compared with contemporary methods, our method brings a significant improvement in learning performance with much lower communication overheads.
Collapse
|
9
|
Feng X, Yao X, Shen H, Cheng G, Xiao B, Han J. Learning an Invariant and Equivariant Network for Weakly Supervised Object Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:11977-11992. [PMID: 37167047 DOI: 10.1109/tpami.2023.3275142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Weakly Supervised Object Detection (WSOD) is of increasing importance in the community of computer vision as its extensive applications and low manual cost. Most of the advanced WSOD approaches build upon an indefinite and quality-agnostic framework, leading to unstable and incomplete object detectors. This paper attributes these issues to the process of inconsistent learning for object variations and the unawareness of localization quality and constructs a novel end-to-end Invariant and Equivariant Network (IENet). It is implemented with a flexible multi-branch online refinement, to be naturally more comprehensive-perceptive against various objects. Specifically, IENet first performs label propagation from the predicted instances to their transformed ones in a progressive manner, achieving affine-invariant learning. Meanwhile, IENet also naturally utilizes rotation-equivariant learning as a pretext task and derives an instance-level rotation-equivariant branch to be aware of the localization quality. With affine-invariance learning and rotation-equivariant learning, IENet urges consistent and holistic feature learning for WSOD without additional annotations. On the challenging datasets of both natural scenes and aerial scenes, we substantially boost WSOD to new state-of-the-art performance. The codes have been released at: https://github.com/XiaoxFeng/IENet.
Collapse
|
10
|
Wu Z, Liu C, Wen J, Xu Y, Yang J, Li X. Selecting High-Quality Proposals for Weakly Supervised Object Detection With Bottom-Up Aggregated Attention and Phase-Aware Loss. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:682-693. [PMID: 37015622 DOI: 10.1109/tip.2022.3231744] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Weakly supervised object detection (WSOD) has received widespread attention since it requires only image-category annotations for detector training. Many advanced approaches solve this problem by a two-phase learning framework, that is, instance mining that classifies generated proposals via multiple instance learning, and instance refinement that iteratively refines bounding boxes using the supervision produced by the preceding stage. In this paper, we observe that the detection performance is usually limited by imprecise supervision, including part domination and untight boxes. To mitigate their adverse effects, we focus on selecting high-quality proposals as the supervision for WSOD. To be specific, for the issue of part domination, we propose bottom-up aggregated attention which incorporates low-level features from shallow layers to improve location representation of top-level features. In this manner, the proposals corresponding to entire objects can get high scores. Its advantage is that it can be flexibly plugged into the WSOD framework since there is no need to attach learnable parameters or learning branches. As regards the problem of untight boxes, we propose a phase-aware loss, which is the first work to measure supervision quality by the loss in the instance mining phase, to highlight correct boxes and suppress untight ones. In this work, we unify the proposed two modules into the framework of online instance classifier refinement. Extensive experiments on the PASCAL VOC and the MS COCO demonstrate that our method can significantly improve the performance of WSOD and achieve the state-of-the-art results. The code is available at https://github.com/Horatio9702/BUAA_PALoss.
Collapse
|
11
|
Ye Q, Wan F, Liu C, Huang Q, Ji X. Continuation Multiple Instance Learning for Weakly and Fully Supervised Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5452-5466. [PMID: 33861707 DOI: 10.1109/tnnls.2021.3070801] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Weakly supervised object detection (WSOD) is a challenging task that requires simultaneously learning object detectors and estimating object locations under the supervision of image category labels. Many WSOD methods that adopt multiple instance learning (MIL) have nonconvex objective functions and, therefore, are prone to get stuck in local minima (falsely localize object parts) while missing full object extent during training. In this article, we introduce classical continuation optimization into MIL, thereby creating continuation MIL (C-MIL) with the aim to alleviate the nonconvexity problem in a systematic way. To fulfill this purpose, we partition instances into class-related and spatially related subsets and approximate MIL's objective function with a series of smoothed objective functions defined within the subsets. We further propose a parametric strategy to implement continuation smooth functions, which enables C-MIL to be applied to instance selection tasks in a uniform manner. Optimizing smoothed loss functions prevents the training procedure from falling prematurely into local minima and facilities learning full object extent. Extensive experiments demonstrate the superiority of CMIL over conventional MIL methods. As a general instance selection method, C-MIL is also applied to supervised object detection to optimize anchors/features, improving the detection performance with a significant margin.
Collapse
|
12
|
Li Y, Xue Y, Li L, Zhang X, Qian X. Domain Adaptive Box-Supervised Instance Segmentation Network for Mitosis Detection. IEEE TRANSACTIONS ON MEDICAL IMAGING 2022; 41:2469-2485. [PMID: 35389862 DOI: 10.1109/tmi.2022.3165518] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The number of mitotic cells present in histopathological slides is an important predictor of tumor proliferation in the diagnosis of breast cancer. However, the current approaches can hardly perform precise pixel-level prediction for mitosis datasets with only weak labels (i.e., only provide the centroid location of mitotic cells), and take no account of the large domain gap across histopathological slides from different pathology laboratories. In this work, we propose a Domain adaptive Box-supervised Instance segmentation Network (DBIN) to address the above issues. In DBIN, we propose a high-performance Box-supervised Instance-Aware (BIA) head with the core idea of redesigning three box-supervised mask loss terms. Furthermore, we add a Pseudo-Mask-supervised Semantic (PMS) head for enriching characteristics extracted from underlying feature maps. Besides, we align the pixel-level feature distributions between source and target domains by a Cross-Domain Adaptive Module (CDAM), so as to adapt the detector learned from one lab can work well on unlabeled data from another lab. The proposed method achieves state-of-the-art performance across four mainstream datasets. A series of analysis and experiments show that our proposed BIA and PMS head can accomplish mitosis pixel-wise localization under weak supervision, and we can boost the generalization ability of our model by CDAM.
Collapse
|
13
|
Zhang D, Han J, Cheng G, Yang MH. Weakly Supervised Object Localization and Detection: A Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:5866-5885. [PMID: 33877967 DOI: 10.1109/tpami.2021.3074313] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
As an emerging and challenging problem in the computer vision community, weakly supervised object localization and detection plays an important role for developing new generation computer vision systems and has received significant attention in the past decade. As methods have been proposed, a comprehensive survey of these topics is of great importance. In this work, we review (1) classic models, (2) approaches with feature representations from off-the-shelf deep networks, (3) approaches solely based on deep learning, and (4) publicly available datasets and standard evaluation metrics that are widely used in this field. We also discuss the key challenges in this field, development history of this field, advantages/disadvantages of the methods in each category, the relationships between methods in different categories, applications of the weakly supervised object localization and detection methods, and potential future directions to further promote the development of this research field.
Collapse
|
14
|
Wu Z, Wen J, Xu Y, Yang J, Li X, Zhang D. Enhanced Spatial Feature Learning for Weakly Supervised Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:961-972. [PMID: 35675239 DOI: 10.1109/tnnls.2022.3178180] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Weakly supervised object detection (WSOD) has become an effective paradigm, which requires only class labels to train object detectors. However, WSOD detectors are prone to learn highly discriminative features corresponding to local objects rather than complete objects, resulting in imprecise object localization. To address the issue, designing backbones specifically for WSOD is a feasible solution. However, the redesigned backbone generally needs to be pretrained on large-scale ImageNet or trained from scratch, both of which require much more time and computational costs than fine-tuning. In this article, we explore to optimize the backbone without losing the availability of the original pretrained model. Since the pooling layer summarizes neighborhood features, it is crucial to spatial feature learning. In addition, it has no learnable parameters, so its modification will not change the pretrained model. Based on the above analysis, we further propose enhanced spatial feature learning (ESFL) for WSOD, which first takes full advantage of multiple kernels in a single pooling layer to handle multiscale objects and then enhances above-average activations within the rectangular neighborhood to alleviate the problem of ignoring unsalient object parts. The experimental results on the PASCAL VOC and the MS COCO benchmarks demonstrate that ESFL can bring significant performance improvement for the WSOD method and achieve state-of-the-art results.
Collapse
|
15
|
Zhang D, Zeng W, Yao J, Han J. Weakly Supervised Object Detection Using Proposal- and Semantic-Level Relationships. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:3349-3363. [PMID: 33351751 DOI: 10.1109/tpami.2020.3046647] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In recent years, weakly supervised object detection has attracted great attention in the computer vision community. Although numerous deep learning-based approaches have been proposed in the past few years, such an ill-posed problem is still challenging and the learning performance is still behind the expectation. In fact, most of the existing approaches only consider the visual appearance of each proposal region but ignore to make use of the helpful context information. To this end, this paper introduces two levels of context into the weakly supervised learning framework. The first one is the proposal-level context, i.e., the relationship of the spatially adjacent proposals. The second one is the semantic-level context, i.e., the relationship of the co-occurring object categories. Therefore, the proposed weakly supervised learning framework contains not only the cognition process on the visual appearance but also the reasoning process on the proposal- and semantic-level relationships, which leads to the novel deep multiple instance reasoning framework. Specifically, built upon a conventional CNN-based network architecture, the proposed framework is equipped with two additional graph convolutional network-based reasoning models to implement object location reasoning and multi-label reasoning within an end-to-end network training procedure. Comprehensive experiments on the widely used PASCAL VOC and MS COCO benchmarks have been implemented, which demonstrate the superior capacity of the proposed approach when compared with other state-of-the-art methods and baseline models.
Collapse
|
16
|
Sun X, Wang H, He B. MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5589-5599. [PMID: 34110992 DOI: 10.1109/tip.2021.3086591] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The amount of videos over the Internet and electronic surveillant cameras is growing dramatically, meanwhile paired sentence descriptions are significant clues to select attentional contents from videos. The task of natural language moment retrieval (NLMR) has drawn great interests from both academia and industry, which aims to associate specific video moments with the text descriptions figuring complex scenarios and multiple activities. In general, NLMR requires temporal context to be properly comprehended, and the existing studies suffer from two problems: (1) limited moment selection and (2) insufficient comprehension of structural context. To address these issues, a multi-agent boundary-aware network (MABAN) is proposed in this work. To guarantee flexible and goal-oriented moment selection, MABAN utilizes multi-agent reinforcement learning to decompose NLMR into localizing the two temporal boundary points for each moment. Specially, MABAN employs a two-phase cross-modal interaction to exploit the rich contextual semantic information. Moreover, temporal distance regression is considered to deduce the temporal boundaries, with which the agents can enhance the comprehension of structural context. Extensive experiments are carried out on two challenging benchmark datasets of ActivityNet Captions and Charades-STA, which demonstrate the effectiveness of the proposed approach as compared to state-of-the-art methods. The project page can be found in https://mic.tongji.edu.cn/e5/23/c9778a189731/page.htm.
Collapse
|
17
|
SODA: Weakly Supervised Temporal Action Localization Based on Astute Background Response and Self-Distillation Learning. Int J Comput Vis 2021. [DOI: 10.1007/s11263-021-01473-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|