1
|
Pei J, Jiang T, Tang H, Liu N, Jin Y, Fan DP, Heng PA. CalibNet: Dual-Branch Cross-Modal Calibration for RGB-D Salient Instance Segmentation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:4348-4362. [PMID: 39074016 DOI: 10.1109/tip.2024.3432328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
In this study, we propose a novel approach for RGB-D salient instance segmentation using a dual-branch cross-modal feature calibration architecture called CalibNet. Our method simultaneously calibrates depth and RGB features in the kernel and mask branches to generate instance-aware kernels and mask features. CalibNet consists of three simple modules, a dynamic interactive kernel (DIK) and a weight-sharing fusion (WSF), which work together to generate effective instance-aware kernels and integrate cross-modal features. To improve the quality of depth features, we incorporate a depth similarity assessment (DSA) module prior to DIK and WSF. In addition, we further contribute a new DSIS dataset, which contains 1,940 images with elaborate instance-level annotations. Extensive experiments on three challenging benchmarks show that CalibNet yields a promising result, i.e., 58.0% AP with 320×480 input size on the COME15K-E test set, which significantly surpasses the alternative frameworks. Our code and dataset will be publicly available at: https://github.com/PJLallen/CalibNet.
Collapse
|
2
|
Cong R, Yang N, Li C, Fu H, Zhao Y, Huang Q, Kwong S. Global-and-Local Collaborative Learning for Co-Salient Object Detection. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:1920-1931. [PMID: 35867373 DOI: 10.1109/tcyb.2022.3169431] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images. Therefore, how to effectively extract interimage correspondence is crucial for the CoSOD task. In this article, we propose a global-and-local collaborative learning (GLNet) architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM) to capture the comprehensive interimage corresponding relationship among different images from the global and local perspectives. First, we treat different images as different time slices and use 3-D convolution to integrate all intrafeatures intuitively, which can more fully extract the global group semantics. Second, we design a pairwise correlation transformation (PCT) to explore similarity correspondence between pairwise images and combine the multiple local pairwise correspondences to generate the local interimage relationship. Third, the interimage relationships of the GCM and LCM are integrated through a global-and-local correspondence aggregation (GLA) module to explore more comprehensive interimage collaboration cues. Finally, the intra and inter features are adaptively integrated by an intra-and-inter weighting fusion (AEWF) module to learn co-saliency features and predict the co-saliency map. The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms 11 state-of-the-art competitors trained on some large datasets (about 8k-200k images).
Collapse
|
3
|
Li J, Ji W, Zhang M, Piao Y, Lu H, Cheng L. Delving into Calibrated Depth for Accurate RGB-D Salient Object Detection. Int J Comput Vis 2022. [DOI: 10.1007/s11263-022-01734-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
4
|
Wang Y, Li S. Similarity activation map for co-salient object detection. Pattern Recognit Lett 2022. [DOI: 10.1016/j.patrec.2022.10.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
|
5
|
Zhang N, Han J, Liu N. Learning Implicit Class Knowledge for RGB-D Co-Salient Object Detection With Transformers. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4556-4570. [PMID: 35763477 DOI: 10.1109/tip.2022.3185550] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
RGB-D co-salient object detection aims to segment co-occurring salient objects when given a group of relevant images and depth maps. Previous methods often adopt separate pipeline and use hand-crafted features, being hard to capture the patterns of co-occurring salient objects and leading to unsatisfactory results. Using end-to-end CNN models is a straightforward idea, but they are less effective in exploiting global cues due to the intrinsic limitation. Thus, in this paper, we alternatively propose an end-to-end transformer-based model which uses class tokens to explicitly capture implicit class knowledge to perform RGB-D co-salient object detection, denoted as CTNet. Specifically, we first design adaptive class tokens for individual images to explore intra-saliency cues and then develop common class tokens for the whole group to explore inter-saliency cues. Besides, we also leverage the complementary cues between RGB images and depth maps to promote the learning of the above two types of class tokens. In addition, to promote model evaluation, we construct a challenging and large-scale benchmark dataset, named RGBD CoSal1k, which collects 106 groups containing 1000 pairs of RGB-D images with complex scenarios and diverse appearances. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method.
Collapse
|
6
|
Liu Y, Zhang D, Zhang Q, Han J. Part-Object Relational Visual Saliency. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:3688-3704. [PMID: 33481705 DOI: 10.1109/tpami.2021.3053577] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recent years have witnessed a big leap in automatic visual saliency detection attributed to advances in deep learning, especially Convolutional Neural Networks (CNNs). However, inferring the saliency of each image part separately, as was adopted by most CNNs methods, inevitably leads to an incomplete segmentation of the salient object. In this paper, we describe how to use the property of part-object relations endowed by the Capsule Network (CapsNet) to solve the problems that fundamentally hinge on relational inference for visual saliency detection. Concretely, we put in place a two-stream strategy, termed Two-Stream Part-Object RelaTional Network (TSPORTNet), to implement CapsNet, aiming to reduce both the network complexity and the possible redundancy during capsule routing. Additionally, taking into account the correlations of capsule types from the preceding training images, a correlation-aware capsule routing algorithm is developed for more accurate capsule assignments at the training stage, which also speeds up the training dramatically. By exploring part-object relationships, TSPORTNet produces a capsule wholeness map, which in turn aids multi-level features in generating the final saliency map. Experimental results on five widely-used benchmarks show that our framework consistently achieves state-of-the-art performance. The code can be found on https://github.com/liuyi1989/TSPORTNet.
Collapse
|
7
|
Sun Y, Li L, Yao T, Lu T, Zheng B, Yan C, Zhang H, Bao Y, Ding G, Slabaugh G. Bidirectional difference locating and semantic consistency reasoning for change captioning. INT J INTELL SYST 2022. [DOI: 10.1002/int.22821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Yaoqi Sun
- School of Automation, College of Computer Science and Technology Hangzhou Dianzi University Hangzhou China
| | - Liang Li
- Institute of Computing Technology CAS Beijing China
| | - Tingting Yao
- School of Automation, College of Computer Science and Technology Hangzhou Dianzi University Hangzhou China
| | - Tongyv Lu
- School of Automation, College of Computer Science and Technology Hangzhou Dianzi University Hangzhou China
| | - Bolun Zheng
- School of Automation, College of Computer Science and Technology Hangzhou Dianzi University Hangzhou China
| | - Chenggang Yan
- School of Automation, College of Computer Science and Technology Hangzhou Dianzi University Hangzhou China
| | - Hua Zhang
- School of Automation, College of Computer Science and Technology Hangzhou Dianzi University Hangzhou China
| | - Yongjun Bao
- Data and Intelligence Department JD.com Beijing China
| | - Guiguang Ding
- School of Software Tsinghua University Beijing China
| | - Gregory Slabaugh
- Digital Environment Research Institute (DERI) Queen Mary University of London London UK
| |
Collapse
|
8
|
Wang X, Zhu L, Tang S, Fu H, Li P, Wu F, Yang Y, Zhuang Y. Boosting RGB-D Saliency Detection by Leveraging Unlabeled RGB Images. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1107-1119. [PMID: 34990359 DOI: 10.1109/tip.2021.3139232] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Training deep models for RGB-D salient object detection (SOD) often requires a large number of labeled RGB-D images. However, RGB-D data is not easily acquired, which limits the development of RGB-D SOD techniques. To alleviate this issue, we present a Dual-Semi RGB-D Salient Object Detection Network (DS-Net) to leverage unlabeled RGB images for boosting RGB-D saliency detection. We first devise a depth decoupling convolutional neural network (DDCNN), which contains a depth estimation branch and a saliency detection branch. The depth estimation branch is trained with RGB-D images and then used to estimate the pseudo depth maps for all unlabeled RGB images to form the paired data. The saliency detection branch is used to fuse the RGB feature and depth feature to predict the RGB-D saliency. Then, the whole DDCNN is assigned as the backbone in a teacher-student framework for semi-supervised learning. Moreover, we also introduce a consistency loss on the intermediate attention and saliency maps for the unlabeled data, as well as a supervised depth and saliency loss for labeled data. Experimental results on seven widely-used benchmark datasets demonstrate that our DDCNN outperforms state-of-the-art methods both quantitatively and qualitatively. We also demonstrate that our semi-supervised DS-Net can further improve the performance, even when using an RGB image with the pseudo depth map.
Collapse
|
9
|
Lin Z, Jia J, Huang F, Gao W. Feature Correlation-Steered Capsule Network for object detection. Neural Netw 2021; 147:25-41. [PMID: 34953299 DOI: 10.1016/j.neunet.2021.12.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2021] [Revised: 10/28/2021] [Accepted: 12/02/2021] [Indexed: 11/28/2022]
Abstract
Despite Convolutional Neural Networks (CNNs) based approaches have been successful in objects detection, they predominantly focus on positioning discriminative regions while overlooking the internal holistic part-whole associations within objects. This would ultimately lead to the neglect of feature relationships between object and its parts as well as among those parts, both of which are significantly helpful for detecting discriminative parts. In this paper, we propose to "look insider the objects" by digging into part-whole feature correlations and take the attempts to leverage those correlations endowed by the Capsule Network (CapsNet) for robust object detection. Actually, highly correlated capsules across adjacent layers share high familiarity, which will be more likely to be routed together. In light of this, we take such correlations between different capsules of the preceding training samples as an awareness to constrain the subsequent candidate voting scope during the routing procedure, and a Feature Correlation-Steered CapsNet (FCS-CapsNet) with Locally-Constrained Expectation-Maximum (EM) Routing Agreement (LCEMRA) is proposed. Different from conventional EM routing, LCEMRA stipulates that only those relevant low-level capsules (parts) meeting the requirement of quantified intra-object cohesiveness can be clustered to make up high-level capsules (objects). In doing so, part-object associations can be dug by transformation weighting matrixes between capsules layers during such "part backtracking" procedure. LCEMRA enables low-level capsules to selectively gather projections from a non-spatially-fixed set of high-level capsules. Experiments on VOC2007, VOC2012, HKU-IS, DUTS, and COCO show that FCS-CapsNet can achieve promising object detection effects across multiple evaluation metrics, which are on-par with state-of-the-arts.
Collapse
Affiliation(s)
- Zhongqi Lin
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China; Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, Beijing, 100083, China
| | - Jingdun Jia
- Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, Beijing, 100083, China
| | - Feng Huang
- College of Science, China Agricultural University, Beijing, 100083, China.
| | - Wanlin Gao
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 100083, China; Key Laboratory of Agricultural Informatization Standardization, Ministry of Agriculture and Rural Affairs, Beijing, 100083, China.
| |
Collapse
|
10
|
Chen Z, Cong R, Xu Q, Huang Q. DPANet: Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:7012-7024. [PMID: 33141667 DOI: 10.1109/tip.2020.3028289] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
There are two main issues in RGB-D salient object detection: (1) how to effectively integrate the complementarity from the cross-modal RGB-D data; (2) how to prevent the contamination effect from the unreliable depth map. In fact, these two problems are linked and intertwined, but the previous methods tend to focus only on the first problem and ignore the consideration of depth map quality, which may yield the model fall into the sub-optimal state. In this paper, we address these two issues in a holistic model synergistically, and propose a novel network named DPANet to explicitly model the potentiality of the depth map and effectively integrate the cross-modal complementarity. By introducing the depth potentiality perception, the network can perceive the potentiality of depth information in a learning-based manner, and guide the fusion process of two modal data to prevent the contamination occurred. The gated multi-modality attention module in the fusion process exploits the attention mechanism with a gate controller to capture long-range dependencies from a cross-modal perspective. Experimental results compared with 16 state-of-the-art methods on 8 datasets demonstrate the validity of the proposed approach both quantitatively and qualitatively. https://github.com/JosephChenHub/DPANet.
Collapse
|
11
|
Yuan Y, Ning H, Lu X. Bio-Inspired Representation Learning for Visual Attention Prediction. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:3562-3575. [PMID: 31484145 DOI: 10.1109/tcyb.2019.2931735] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Visual attention prediction (VAP) is a significant and imperative issue in the field of computer vision. Most of the existing VAP methods are based on deep learning. However, they do not fully take advantage of the low-level contrast features while generating the visual attention map. In this article, a novel VAP method is proposed to generate the visual attention map via bio-inspired representation learning. The bio-inspired representation learning combines both low-level contrast and high-level semantic features simultaneously, which are developed by the fact that the human eye is sensitive to the patches with high contrast and objects with high semantics. The proposed method is composed of three main steps: 1) feature extraction; 2) bio-inspired representation learning; and 3) visual attention map generation. First, the high-level semantic feature is extracted from the refined VGG16, while the low-level contrast feature is extracted by the proposed contrast feature extraction block in a deep network. Second, during bio-inspired representation learning, both the extracted low-level contrast and high-level semantic features are combined by the designed densely connected block, which is proposed to concatenate various features scale by scale. Finally, the weighted-fusion layer is exploited to generate the ultimate visual attention map based on the obtained representations after bio-inspired representation learning. Extensive experiments are performed to demonstrate the effectiveness of the proposed method.
Collapse
|
12
|
CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse. Int J Comput Vis 2021. [DOI: 10.1007/s11263-021-01452-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
13
|
Fan DP, Lin Z, Zhang Z, Zhu M, Cheng MM. Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:2075-2089. [PMID: 32491986 DOI: 10.1109/tnnls.2020.2996406] [Citation(s) in RCA: 90] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to RGB-D SOD: 1) we carefully collect a new Salient Person (SIP) data set that consists of ~1 K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and background s; 2) we conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research, and we systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven data sets containing a total of about 97k images; and 3) we propose a simple general architecture, called deep depth-depurator network (D3Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning, respectively. These components form a nested structure and are elaborately designed to be learned jointly. D3Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D3Net can be used to efficiently extract salient object masks from real scenes, enabling effective background-changing application with a speed of 65 frames/s on a single GPU. All the saliency maps, our new SIP data set, the D3Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark.
Collapse
|
14
|
Stereo superpixel: An iterative framework based on parallax consistency and collaborative optimization. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.12.031] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
15
|
Ma G, Li S, Chen C, Hao A, Qin H. Rethinking Image Salient Object Detection: Object-Level Semantic Saliency Reranking First, Pixelwise Saliency Refinement Later. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4238-4252. [PMID: 33819154 DOI: 10.1109/tip.2021.3068649] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Human attention is an interactive activity between our visual system and our brain, using both low-level visual stimulus and high-level semantic information. Previous image salient object detection (SOD) studies conduct their saliency predictions via a multitask methodology in which pixelwise saliency regression and segmentation-like saliency refinement are conducted simultaneously. However, this multitask methodology has one critical limitation: the semantic information embedded in feature backbones might be degenerated during the training process. Our visual attention is determined mainly by semantic information, which is evidenced by our tendency to pay more attention to semantically salient regions even if these regions are not the most perceptually salient at first glance. This fact clearly contradicts the widely used multitask methodology mentioned above. To address this issue, this paper divides the SOD problem into two sequential steps. First, we devise a lightweight, weakly supervised deep network to coarsely locate the semantically salient regions. Next, as a postprocessing refinement, we selectively fuse multiple off-the-shelf deep models on the semantically salient regions identified by the previous step to formulate a pixelwise saliency map. Compared with the state-of-the-art (SOTA) models that focus on learning the pixelwise saliency in single images using only perceptual clues, our method aims at investigating the object-level semantic ranks between multiple images, of which the methodology is more consistent with the human attention mechanism. Our method is simple yet effective, and it is the first attempt to consider salient object detection as mainly an object-level semantic reranking problem.
Collapse
|
16
|
Wang X, Li S, Chen C, Hao A, Qin H. Depth quality-aware selective saliency fusion for RGB-D image salient object detection. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.071] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
17
|
Chen C, Wei J, Peng C, Qin H. Depth-Quality-Aware Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:2350-2363. [PMID: 33481710 DOI: 10.1109/tip.2021.3052069] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The existing fusion-based RGB-D salient object detection methods usually adopt the bistream structure to strike a balance in the fusion trade-off between RGB and depth (D). While the D quality usually varies among the scenes, the state-of-the-art bistream approaches are depth-quality-unaware, resulting in substantial difficulties in achieving complementary fusion status between RGB and D and leading to poor fusion results for low-quality D. Thus, this paper attempts to integrate a novel depth-quality-aware subnet into the classic bistream structure in order to assess the depth quality prior to conducting the selective RGB-D fusion. Compared to the SOTA bistream methods, the major advantage of our method is its ability to lessen the importance of the low-quality, no-contribution, or even negative-contribution D regions during RGB-D fusion, achieving a much improved complementary status between RGB and D. Our source code and data are available online at https://github.com/qdu1995/DQSD.
Collapse
|
18
|
Zhou T, Fan DP, Cheng MM, Shen J, Shao L. RGB-D salient object detection: A survey. COMPUTATIONAL VISUAL MEDIA 2021; 7:37-69. [PMID: 33432275 PMCID: PMC7788385 DOI: 10.1007/s41095-020-0199-z] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Accepted: 10/07/2020] [Indexed: 06/12/2023]
Abstract
Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.
Collapse
Affiliation(s)
- Tao Zhou
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | - Deng-Ping Fan
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | | | - Jianbing Shen
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | - Ling Shao
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| |
Collapse
|
19
|
Wang W, Shen J, Xie J, Cheng MM, Ling H, Borji A. Revisiting Video Saliency Prediction in the Deep Learning Era. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:220-237. [PMID: 31247542 DOI: 10.1109/tpami.2019.2924417] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.
Collapse
|
20
|
Li C, Cong R, Kwong S, Hou J, Fu H, Zhu G, Zhang D, Huang Q. ASIF-Net: Attention Steered Interweave Fusion Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:88-100. [PMID: 32078571 DOI: 10.1109/tcyb.2020.2969255] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Salient object detection from RGB-D images is an important yet challenging vision task, which aims at detecting the most distinctive objects in a scene by combining color information and depth constraints. Unlike prior fusion manners, we propose an attention steered interweave fusion network (ASIF-Net) to detect salient objects, which progressively integrates cross-modal and cross-level complementarity from the RGB image and corresponding depth map via steering of an attention mechanism. Specifically, the complementary features from RGB-D images are jointly extracted and hierarchically fused in a dense and interweaved manner. Such a manner breaks down the barriers of inconsistency existing in the cross-modal data and also sufficiently captures the complementarity. Meanwhile, an attention mechanism is introduced to locate the potential salient regions in an attention-weighted fashion, which advances in highlighting the salient objects and suppressing the cluttered background regions. Instead of focusing only on pixelwise saliency, we also ensure that the detected salient objects have the objectness characteristics (e.g., complete structure and sharp boundary) by incorporating the adversarial learning that provides a global semantic constraint for RGB-D salient object detection. Quantitative and qualitative experiments demonstrate that the proposed method performs favorably against 17 state-of-the-art saliency detectors on four publicly available RGB-D salient object detection datasets. The code and results of our method are available at https://github.com/Li-Chongyi/ASIF-Net.
Collapse
|
21
|
Zhang Q, Cong R, Li C, Cheng MM, Fang Y, Cao X, Zhao Y, Kwong S. Dense Attention Fluid Network for Salient Object Detection in Optical Remote Sensing Images. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 30:1305-1317. [PMID: 33306467 DOI: 10.1109/tip.2020.3042084] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Despite the remarkable advances in visual saliency analysis for natural scene images (NSIs), salient object detection (SOD) for optical remote sensing images (RSIs) still remains an open and challenging problem. In this paper, we propose an end-to-end Dense Attention Fluid Network (DAFNet) for SOD in optical RSIs. A Global Context-aware Attention (GCA) module is proposed to adaptively capture long-range semantic context relationships, and is further embedded in a Dense Attention Fluid (DAF) structure that enables shallow attention cues flow into deep layers to guide the generation of high-level feature attention maps. Specifically, the GCA module is composed of two key components, where the global feature aggregation module achieves mutual reinforcement of salient feature embeddings from any two spatial locations, and the cascaded pyramid attention module tackles the scale variation issue by building up a cascaded pyramid framework to progressively refine the attention map in a coarse-to-fine manner. In addition, we construct a new and challenging optical RSI dataset for SOD that contains 2,000 images with pixel-wise saliency annotations, which is currently the largest publicly available benchmark. Extensive experiments demonstrate that our proposed DAFNet significantly outperforms the existing state-of-the-art SOD competitors. https://github.com/rmcong/DAFNet_TIP20.
Collapse
|
22
|
Zhang X, Wang Z, Hu Q, Ren J, Sun M. Boundary-aware High-resolution Network with region enhancement for salient object detection. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.08.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
23
|
Li C, Cong R, Guo C, Li H, Zhang C, Zheng F, Zhao Y. A parallel down-up fusion network for salient object detection in optical remote sensing images. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.05.108] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
24
|
Chen H, Li Y, Su D. Discriminative Cross-Modal Transfer Learning and Densely Cross-Level Feedback Fusion for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4808-4820. [PMID: 31484153 DOI: 10.1109/tcyb.2019.2934986] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This article addresses two key issues in RGB-D salient object detection based on the convolutional neural network (CNN). 1) How to bridge the gap between the "data-hungry" nature of CNNs and the insufficient labeled training data in the depth modality? 2) How to take full advantages of the complementary information among two modalities. To solve the first problem, we model the depth-induced saliency detection as a CNN-based cross-modal transfer learning problem. Instead of directly adopting the RGB CNN as initialization, we additionally train a modality classification network (MCNet) to encourage discriminative modality-specific representations in minimizing the modality classification loss. To solve the second problem, we propose a densely cross-level feedback topology, in which the cross-modal complements are combined in each level and then densely fed back to all shallower layers for sufficient cross-level interactions. Compared to traditional two-stream frameworks, the proposed one can better explore, select, and fuse cross-modal cross-level complements. Experiments show the significant and consistent improvements of the proposed CNN framework over other state-of-the-art methods.
Collapse
|
25
|
Cong R, Lei J, Fu H, Hou J, Huang Q, Kwong S. Going From RGB to RGBD Saliency: A Depth-Guided Transformation Model. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:3627-3639. [PMID: 31443060 DOI: 10.1109/tcyb.2019.2932005] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Depth information has been demonstrated to be useful for saliency detection. However, the existing methods for RGBD saliency detection mainly focus on designing straightforward and comprehensive models, while ignoring the transferable ability of the existing RGB saliency detection models. In this article, we propose a novel depth-guided transformation model (DTM) going from RGB saliency to RGBD saliency. The proposed model includes three components, that is: 1) multilevel RGBD saliency initialization; 2) depth-guided saliency refinement; and 3) saliency optimization with depth constraints. The explicit depth feature is first utilized in the multilevel RGBD saliency model to initialize the RGBD saliency by combining the global compactness saliency cue and local geodesic saliency cue. The depth-guided saliency refinement is used to further highlight the salient objects and suppress the background regions by introducing the prior depth domain knowledge and prior refined depth shape. Benefiting from the consistency of the entire object in the depth map, we formulate an optimization model to attain more consistent and accurate saliency results via an energy function, which integrates the unary data term, color smooth term, and depth consistency term. Experiments on three public RGBD saliency detection benchmarks demonstrate the effectiveness and performance improvement of the proposed DTM from RGB to RGBD saliency.
Collapse
|
26
|
Wang W, Shen J, Dong X, Borji A, Yang R. Inferring Salient Objects from Human Fixations. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2020; 42:1913-1927. [PMID: 30892201 DOI: 10.1109/tpami.2019.2905607] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Previous research in visual saliency has been focused on two major types of models namely fixation prediction and salient object detection. The relationship between the two, however, has been less explored. In this work, we propose to employ the former model type to identify salient objects. We build a novel Attentive Saliency Network (ASNet)1 1.Available at: https://github.com/wenguanwang/ASNet. that learns to detect salient objects from fixations. The fixation map, derived at the upper network layers, mimics human visual attention mechanisms and captures a high-level understanding of the scene from a global view. Salient object detection is then viewed as fine-grained object-level saliency segmentation and is progressively optimized with the guidance of the fixation map in a top-down manner. ASNet is based on a hierarchy of convLSTMs that offers an efficient recurrent mechanism to sequentially refine the saliency features over multiple steps. Several loss functions, derived from existing saliency evaluation metrics, are incorporated to further boost the performance. Extensive experiments on several challenging datasets show that our ASNet outperforms existing methods and is capable of generating accurate segmentation maps with the help of the computed fixation prior. Our work offers a deeper insight into the mechanisms of attention and narrows the gap between salient object detection and fixation prediction.
Collapse
|
27
|
Recent Advances in Saliency Estimation for Omnidirectional Images, Image Groups, and Video Sequences. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10155143] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
We present a review of methods for automatic estimation of visual saliency: the perceptual property that makes specific elements in a scene stand out and grab the attention of the viewer. We focus on domains that are especially recent and relevant, as they make saliency estimation particularly useful and/or effective: omnidirectional images, image groups for co-saliency, and video sequences. For each domain, we perform a selection of recent methods, we highlight their commonalities and differences, and describe their unique approaches. We also report and analyze the datasets involved in the development of such methods, in order to reveal additional peculiarities of each domain, such as the representation used for the ground truth saliency information (scanpaths, saliency maps, or salient object regions). We define domain-specific evaluation measures, and provide quantitative comparisons on the basis of common datasets and evaluation criteria, highlighting the different impact of existing approaches on each domain. We conclude by synthesizing the emerging directions for research in the specialized literature, which include novel representations for omnidirectional images, inter- and intra- image saliency decomposition for co-saliency, and saliency shift for video saliency estimation.
Collapse
|
28
|
Ji W, Li X, Wei L, Wu F, Zhuang Y. Context-aware Graph Label Propagation Network for Saliency Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; PP:8177-8186. [PMID: 32746240 DOI: 10.1109/tip.2020.3002083] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Recently, a large number of existing methods for saliency detection have mainly focused on designing complex network architectures to aggregate powerful features from backbone networks. However, contextual information is not well utilized, which often causes false background regions and blurred object boundaries. Motivated by these issues, we propose an easyto-implement module that utilizes the edge-preserving ability of superpixels and the graph neural network to interact the context of superpixel nodes. In more detail, we first extract the features from the backbone network and obtain the superpixel information of images. This step is followed by superpixel pooling in which we transfer the irregular superpixel information to a structured feature representation. To propagate the information among the foreground and background regions, we use a graph neural network and self-attention layer to better evaluate the degree of saliency degree. Additionally, an affinity loss is proposed to regularize the affinity matrix to constrain the propagation path. Moreover, we extend our module to a multiscale structure with different numbers of superpixels. Experiments on five challenging datasets show that our approach can improve the performance of three baseline methods in terms of some popular evaluation metrics.
Collapse
|
29
|
Chen C, Wei J, Peng C, Zhang W, Qin H, University Q, University SB. Improved Saliency Detection in RGB-D Images Using Two-phase Depth Estimation and Selective Deep Fusion. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:4296-4307. [PMID: 32012011 DOI: 10.1109/tip.2020.2968250] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
To solve the saliency detection problem in RGB-D images, the depth information plays a critical role in distinguishing salient objects or foregrounds from cluttered backgrounds. As the complementary component to color information, the depth quality directly dictates the subsequent saliency detection performance. However, due to artifacts and the limitation of depth acquisition devices, the quality of the obtained depth varies tremendously across different scenarios. Consequently, conventional selective fusion-based RGB-D saliency detection methods may result in a degraded detection performance in cases containing salient objects with low color contrast coupled with a low depth quality. To solve this problem, we make our initial attempt to estimate additional high-quality depth information, which is denoted by Depth+. Serving as a complement to the original depth, Depth+ will be fed into our newly designed selective fusion network to boost the detection performance. To achieve this aim, we first retrieve a small group of images that are similar to the given input, and then the inter-image, nonlocal correspondences are built accordingly. Thus, by using these inter-image correspondences, the overall depth can be coarsely estimated by utilizing our newly designed depth-transferring strategy. Next, we build fine-grained, object-level correspondences coupled with a saliency prior to further improve the depth quality of the previous estimation. Compared to the original depth, our newly estimated Depth+ is potentially more informative for detection improvement. Finally, we feed both the original depth and the newly estimated Depth+ into our selective deep fusion network, whose key novelty is to achieve an optimal complementary balance to make better decisions toward improving saliency boundaries.
Collapse
|
30
|
Cong R, Lei J, Fu H, Porikli F, Huang Q, Hou C. Video Saliency Detection via Sparsity-Based Reconstruction and Propagation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:4819-4831. [PMID: 31059438 DOI: 10.1109/tip.2019.2910377] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Video saliency detection aims to continuously discover the motion-related salient objects from the video sequences. Since it needs to consider the spatial and temporal constraints jointly, video saliency detection is more challenging than image saliency detection. In this paper, we propose a new method to detect the salient objects in video based on sparse reconstruction and propagation. With the assistance of novel static and motion priors, a single-frame saliency model is first designed to represent the spatial saliency in each individual frame via the sparsity-based reconstruction. Then, through a progressive sparsity-based propagation, the sequential correspondence in the temporal space is captured to produce the inter-frame saliency map. Finally, these two maps are incorporated into a global optimization model to achieve spatio-temporal smoothness and global consistency of the salient object in the whole video. The experiments on three large-scale video saliency datasets demonstrate that the proposed method outperforms the state-of-the-art algorithms both qualitatively and quantitatively.
Collapse
|
31
|
Lai Q, Wang W, Sun H, Shen J. Video Saliency Prediction using Spatiotemporal Residual Attentive Networks. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:1113-1126. [PMID: 31449021 DOI: 10.1109/tip.2019.2936112] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper proposes a novel residual attentive learning network architecture for predicting dynamic eye-fixation maps. The proposed model emphasizes two essential issues, i.e, effective spatiotemporal feature integration and multi-scale saliency learning. For the first problem, appearance and motion streams are tightly coupled via dense residual cross connections, which integrate appearance information with multi-layer, comprehensive motion features in a residual and dense way. Beyond traditional two-stream models learning appearance and motion features separately, such design allows early, multi-path information exchange between different domains, leading to a unified and powerful spatiotemporal learning architecture. For the second one, we propose a composite attention mechanism that learns multi-scale local attentions and global attention priors end-to-end. It is used for enhancing the fused spatiotemporal features via emphasizing important features in multi-scales. A lightweight convolutional Gated Recurrent Unit (convGRU), which is flexible for small training data situation, is used for long-term temporal characteristics modeling. Extensive experiments over four benchmark datasets clearly demonstrate the advantage of the proposed video saliency model over other competitors and the effectiveness of each component of our network. Our code and all the results will be available at https://github.com/ashleylqx/STRA-Net.
Collapse
|
32
|
Chi J, Wu C, Yu X, Chu H, Ji P. Saliency detection via integrating deep learning architecture and low-level features. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.03.070] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
33
|
Liu Y, Han J, Zhang Q, Shan C. Deep Salient Object Detection with Contextual Information Guidance. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:360-374. [PMID: 31380760 DOI: 10.1109/tip.2019.2930906] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Integration of multi-level contextual information, such as feature maps and side outputs, is crucial for Convolutional Neural Networks (CNNs) based salient object detection. However, most existing methods either simply concatenate multi-level feature maps or calculate element-wise addition of multi-level side outputs, thus failing to take full advantages of them. In this work, we propose a new strategy for guiding multi-level contextual information integration, where feature maps and side outputs across layers are fully engaged. Specifically, shallower-level feature maps are guided by the deeper-level side outputs to learn more accurate properties of the salient object. In turn, the deeper-level side outputs can be propagated to high-resolution versions with spatial details complemented by means of shallower-level feature maps. Moreover, a group convolution module is proposed with the aim to achieve high-discriminative feature maps, in which the backbone feature maps are divided into a number of groups and then the convolution is applied to the channels of backbone feature maps within each group. Eventually, the group convolution module is incorporated in the guidance module to further promote the guidance role. Experiments on three public benchmark datasets verify the effectiveness and superiority of the proposed method over the state-of-the-art methods.
Collapse
|
34
|
Wang W, Shen J, Ling H. A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2019; 41:1531-1544. [PMID: 29993710 DOI: 10.1109/tpami.2018.2840724] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We study the problem of photo cropping, which aims to find a cropping window of an input image to preserve as much as possible its important parts while being aesthetically pleasant. Seeking a deep learning-based solution, we design a neural network that has two branches for attention box prediction (ABP) and aesthetics assessment (AA), respectively. Given the input image, the ABP network predicts an attention bounding box as an initial minimum cropping window, around which a set of cropping candidates are generated with little loss of important information. Then, the AA network is employed to select the final cropping window with the best aesthetic quality among the candidates. The two sub-networks are designed to share the same full-image convolutional feature map, and thus are computationally efficient. By leveraging attention prediction and aesthetics assessment, the cropping model produces high-quality cropping results, even with the limited availability of training data for photo cropping. The experimental results on benchmark datasets clearly validate the effectiveness of the proposed approach. In addition, our approach runs at 5 fps, outperforming most previous solutions. The code and results are available at: https://github.com/shenjianbing/DeepCropping.
Collapse
|
35
|
Zhu C, Zhang W, Li TH, Liu S, Li G. Exploiting the Value of the Center-dark Channel Prior for Salient Object Detection. ACM T INTEL SYST TEC 2019. [DOI: 10.1145/3319368] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Saliency detection aims to detect the most attractive objects in images and is widely used as a foundation for various applications. In this article, we propose a novel salient object detection algorithm for RGB-D images using center-dark channel priors. First, we generate an initial saliency map based on a color saliency map and a depth saliency map of a given RGB-D image. Then, we generate a center-dark channel map based on center saliency and dark channel priors. Finally, we fuse the initial saliency map with the center dark channel map to generate the final saliency map. Extensive evaluations over four benchmark datasets demonstrate that our proposed method performs favorably against most of the state-of-the-art approaches. Besides, we further discuss the application of the proposed algorithm in small target detection and demonstrate the universal value of center-dark channel priors in the field of object detection.
Collapse
Affiliation(s)
- Chunbiao Zhu
- Shenzhen Graduate School, Peking University, Shenzhen, Guangdong, China
| | - Wenhao Zhang
- Shenzhen Graduate School, Peking University, Shenzhen, Guangdong, China
| | - Thomas H. Li
- Shenzhen Graduate School, China and Advanced Institute of Information Technology, Peking University, China
| | | | - Ge Li
- Shenzhen Graduate School, Peking University, Shenzhen, Guangdong, China
| |
Collapse
|
36
|
Chen H, Li Y. Three-stream Attention-aware Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:2825-2835. [PMID: 30624216 DOI: 10.1109/tip.2019.2891104] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Previous RGB-D fusion systems based on convolutional neural networks (CNNs) typically employ a two-stream architecture, in which RGB and depth inputs are learnt independently. The multi-modal fusion stage is typically performed by concatenating the deep features from each stream in the inference process. The traditional two-stream architecture might experience insufficient multi-modal fusion due to two following limitations: (1) The cross-modal complementarity is rarely studied in the bottom-up path, wherein we believe the crossmodal complements can be combined to learn new discriminative features to enlarge the RGB-D representation community; (2) The cross-modal channels are typically combined by undifferentiated concatenation, which appears ambiguous to select cross-modal complementary features. In this work, we address these two limitations by proposing a novel three-stream attention-aware multi-modal fusion network. In the proposed architecture, a cross-modal distillation stream, accompanying the RGB-specific and depth-specific streams, is introduced to extract new RGB-D features in each level in the bottom-up path. Furthermore, the channel-wise attention mechanism is innovatively introduced to the cross-modal cross-level fusion problem to adaptively select complementary feature maps from each modality in each level. Extensive experiments report the effectiveness of the proposed architecture and the significant improvement over state-of-theart RGB-D salient object detection methods.
Collapse
|
37
|
Ji Y, Zhang H, Tseng KK, Chow TW, Wu QMJ. Graph model-based salient object detection using objectness and multiple saliency cues. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.09.081] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
38
|
Cong R, Lei J, Fu H, Lin W, Huang Q, Cao X, Hou C. An Iterative Co-Saliency Framework for RGBD Images. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:233-246. [PMID: 29990261 DOI: 10.1109/tcyb.2017.2771488] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
As a newly emerging and significant topic in computer vision community, co-saliency detection aims at discovering the common salient objects in multiple related images. The existing methods often generate the co-saliency map through a direct forward pipeline which is based on the designed cues or initialization, but lack the refinement-cycle scheme. Moreover, they mainly focus on RGB image and ignore the depth information for RGBD images. In this paper, we propose an iterative RGBD co-saliency framework, which utilizes the existing single saliency maps as the initialization, and generates the final RGBD co-saliency map by using a refinement-cycle model. Three schemes are employed in the proposed RGBD co-saliency framework, which include the addition scheme, deletion scheme, and iteration scheme. The addition scheme is used to highlight the salient regions based on intra-image depth propagation and saliency propagation, while the deletion scheme filters the saliency regions and removes the non-common salient regions based on interimage constraint. The iteration scheme is proposed to obtain more homogeneous and consistent co-saliency map. Furthermore, a novel descriptor, named depth shape prior, is proposed in the addition scheme to introduce the depth information to enhance identification of co-salient objects. The proposed method can effectively exploit any existing 2-D saliency model to work well in RGBD co-saliency scenarios. The experiments on two RGBD co-saliency datasets demonstrate the effectiveness of our proposed framework.
Collapse
|
39
|
Tsai CC, Li W, Hsu KJ, Qian X, Lin YY. Image Co-Saliency Detection and Co-Segmentation via Progressive Joint Optimization. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:56-71. [PMID: 30059305 DOI: 10.1109/tip.2018.2861217] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We present a novel computational model for simultaneous image co-saliency detection and co-segmentation that concurrently explores the concepts of saliency and objectness in multiple images. It has been shown that the co-saliency detection via aggregating multiple saliency proposals by diverse visual cues can better highlight the salient objects; however, the optimal proposals are typically region-dependent and the fusion process often leads to blurred results. Co-segmentation can help preserve object boundaries, but it may suffer from complex scenes. To address these issues, we develop a unified method that addresses co-saliency detection and co-segmentation jointly via solving an energy minimization problem over a graph. Our method iteratively carries out the region-wise adaptive saliency map fusion and object segmentation to transfer useful information between the two complementary tasks. Through the optimization iterations, sharp saliency maps are gradually obtained to recover entire salient objects by referring to object segmentation, while these segmentations are progressively improved owing to the better saliency prior. We evaluate our method on four public benchmark data sets while comparing it to the state-of-the-art methods. Extensive experiments demonstrate that our method can provide consistently higher-quality results on both co-saliency detection and co-segmentation.
Collapse
|
40
|
Xi X, Luo Y, Wang P, Qiao H. Salient object detection based on an efficient End-to-End Saliency Regression Network. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.10.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
41
|
Guo C, Li C, Guo J, Cong R, Fu H, Han P. Hierarchical Features Driven Residual Learning for Depth Map Super-Resolution. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 28:2545-2557. [PMID: 30571635 DOI: 10.1109/tip.2018.2887029] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Rapid development of affordable and portable consumer depth cameras facilitates the use of depth information in many computer vision tasks such as intelligent vehicles and 3D reconstruction. However, depth map captured by low-cost depth sensors (e.g., Kinect) usually suffers from low spatial resolution, which limits its potential applications. In this paper, we propose a novel deep network for depth map super-resolution (SR), called DepthSR-Net. The proposed DepthSR-Net automatically infers a high resolution (HR) depth map from its low resolution (LR) version by hierarchical features driven residual learning. Specifically, DepthSR-Net is built on a residual U-Net deep network architecture. Given LR depth map, we first obtain the desired HR by bicubic interpolation upsampling, and then construct an input pyramid to achieve multiple level receptive fields. Next, we extract hierarchical features from the input pyramid, intensity image, and encoder-decoder structure of UNet. Finally, we learn the residual between the interpolated depth map and the corresponding HR one using the rich hierarchical features. The final HR depth map is achieved by adding the learned residual to the interpolated depth map. We conduct an ablation study to demonstrate the effectiveness of each component in the proposed network. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art methods. Additionally, the potential usage of the proposed network in other low-level vision problems is discussed.
Collapse
|
42
|
Liu C, Wang W, Shen J, Shao L. Stereo Video Object Segmentation Using Stereoscopic Foreground Trajectories. IEEE TRANSACTIONS ON CYBERNETICS 2018; 49:3665-3676. [PMID: 29994416 DOI: 10.1109/tcyb.2018.2846361] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We present an unsupervised segmentation framework for stereo videos using stereoscopic trajectories. The proposed stereo trajectory shows favorable properties for modeling the long-term motion information through the whole sequence and explicitly capturing the corresponding relationships between two stereo views. The stereo prior is important for inferring the desired object and guarantees the consistent spatial-temporal segmentation, which contributes to an enjoyable stereo experience. We start by deriving stereo trajectories from left and right views simultaneously, which are represented via a graph structure. Then we detect object-like stereo trajectories via the graph structure to efficiently infer the desired object. Finally, an energy optimization function is proposed to produce the stereo segmentation results via leveraging the object information from stereo trajectories. To benefit potential research, we collected a new stereoscopic video benchmark, which consists of a total of 50 stereo video clips and includes many challenges in segmentation. Extensive experimental results demonstrate that our stereo segmentation method achieves higher performance and preserves better stereo structures, compared with prevailing competitors. The source code and results are available at: https://github.com/shenjianbing/StereoSeg.
Collapse
|
43
|
Zhang D, Fu H, Han J, Borji A, Li X. A Review of Co-Saliency Detection Algorithms. ACM T INTEL SYST TEC 2018. [DOI: 10.1145/3158674] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Co-saliency detection is a newly emerging and rapidly growing research area in the computer vision community. As a novel branch of visual saliency, co-saliency detection refers to the discovery of common and salient foregrounds from two or more relevant images, and it can be widely used in many computer vision tasks. The existing co-saliency detection algorithms mainly consist of three components: extracting effective features to represent the image regions, exploring the informative cues or factors to characterize co-saliency, and designing effective computational frameworks to formulate co-saliency. Although numerous methods have been developed, the literature is still lacking a deep review and evaluation of co-saliency detection techniques. In this article, we aim at providing a comprehensive review of the fundamentals, challenges, and applications of co-saliency detection. Specifically, we provide an overview of some related computer vision works, review the history of co-saliency detection, summarize and categorize the major algorithms in this research area, discuss some open issues in this area, present the potential applications of co-saliency detection, and finally point out some unsolved challenges and promising future works. We expect this review to be beneficial to both fresh and senior researchers in this field and to give insights to researchers in other related areas regarding the utility of co-saliency detection algorithms.
Collapse
Affiliation(s)
| | - Huazhu Fu
- Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore
| | - Junwei Han
- Northwestern Polytechnical University, Xi'an, China
| | - Ali Borji
- University of Central Florida, Orlando, USA
| | - Xuelong Li
- Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China
| |
Collapse
|
44
|
Multi-Saliency Aggregation-Based Approach for Insulator Flashover Fault Detection Using Aerial Images. ENERGIES 2018. [DOI: 10.3390/en11020340] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|