51
|
Wu Z, Su L, Huang Q. Decomposition and Completion Network for Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:6226-6239. [PMID: 34242166 DOI: 10.1109/tip.2021.3093380] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Recently, fully convolutional networks (FCNs) have made great progress in the task of salient object detection and existing state-of-the-arts methods mainly focus on how to integrate edge information in deep aggregation models. In this paper, we propose a novel Decomposition and Completion Network (DCN), which integrates edge and skeleton as complementary information and models the integrity of salient objects in two stages. In the decomposition network, we propose a cross multi-branch decoder, which iteratively takes advantage of cross-task aggregation and cross-layer aggregation to integrate multi-level multi-task features and predict saliency, edge, and skeleton maps simultaneously. In the completion network, edge and skeleton maps are further utilized to fill flaws and suppress noises in saliency maps via hierarchical structure-aware feature learning and multi-scale feature completion. Through jointly learning with edge and skeleton information for localizing boundaries and interiors of salient objects respectively, the proposed network generates precise saliency maps with uniformly and completely segmented salient objects. Experiments conducted on five benchmark datasets demonstrate that the proposed model outperforms existing networks. Furthermore, we extend the proposed model to the task of RGB-D salient object detection, and it also achieves state-of-the-art performance. The code is available at https://github.com/wuzhe71/DCN.
Collapse
|
52
|
CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse. Int J Comput Vis 2021. [DOI: 10.1007/s11263-021-01452-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
53
|
Tu Z, Li Z, Li C, Lang Y, Tang J. Multi-Interactive Dual-Decoder for RGB-Thermal Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5678-5691. [PMID: 34125680 DOI: 10.1109/tip.2021.3087412] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
RGB-thermal salient object detection (SOD) aims to segment the common prominent regions of visible image and corresponding thermal infrared image that we call it RGBT SOD. Existing methods don't fully explore and exploit the potentials of complementarity of different modalities and multi-type cues of image contents, which play a vital role in achieving accurate results. In this paper, we propose a multi-interactive dual-decoder to mine and model the multi-type interactions for accurate RGBT SOD. In specific, we first encode two modalities into multi-level multi-modal feature representations. Then, we design a novel dual-decoder to conduct the interactions of multi-level features, two modalities and global contexts. With these interactions, our method works well in diversely challenging scenarios even in the presence of invalid modality. Finally, we carry out extensive experiments on public RGBT and RGBD SOD datasets, and the results show that the proposed method achieves the outstanding performance against state-of-the-art algorithms. The source code has been released at: https://github.com/lz118/Multi-interactive-Dual-decoder.
Collapse
|
54
|
Towards accurate RGB-D saliency detection with complementary attention and adaptive integration. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.125] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
55
|
Deng Y, Chen H, Chen H, Li Y. Learning From Images: A Distillation Learning Framework for Event Cameras. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4919-4931. [PMID: 33961557 DOI: 10.1109/tip.2021.3077136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Event cameras have recently drawn massive attention in the computer vision community because of their low power consumption and high response speed. These cameras produce sparse and non-uniform spatiotemporal representations of a scene. These characteristics of representations make it difficult for event-based models to extract discriminative cues (such as textures and geometric relationships). Consequently, event-based methods usually perform poorly compared to their conventional image counterparts. Considering that traditional images and event signals share considerable visual information, this paper aims to improve the feature extraction ability of event-based models by using knowledge distilled from the image domain to additionally provide explicit feature-level supervision for the learning of event data. Specifically, we propose a simple yet effective distillation learning framework, including multi-level customized knowledge distillation constraints. Our framework can significantly boost the feature extraction process for event data and is applicable to various downstream tasks. We evaluate our framework on high-level and low-level tasks, i.e., object classification and optical flow prediction. Experimental results show that our framework can effectively improve the performance of event-based models on both tasks by a large margin. Furthermore, we present a 10K dataset (CEP-DVS) for event-based object classification. This dataset consists of samples recorded under random motion trajectories that can better evaluate the motion robustness of the event-based model and is compatible with multi-modality vision tasks.
Collapse
|
56
|
Fan DP, Lin Z, Zhang Z, Zhu M, Cheng MM. Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:2075-2089. [PMID: 32491986 DOI: 10.1109/tnnls.2020.2996406] [Citation(s) in RCA: 90] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to RGB-D SOD: 1) we carefully collect a new Salient Person (SIP) data set that consists of ~1 K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and background s; 2) we conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research, and we systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven data sets containing a total of about 97k images; and 3) we propose a simple general architecture, called deep depth-depurator network (D3Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning, respectively. These components form a nested structure and are elaborately designed to be learned jointly. D3Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D3Net can be used to efficiently extract salient object masks from real scenes, enabling effective background-changing application with a speed of 65 frames/s on a single GPU. All the saliency maps, our new SIP data set, the D3Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark.
Collapse
|
57
|
Fu K, Fan DP, Ji GP, Zhao Q, Shen J, Zhu C. Siamese Network for RGB-D Salient Object Detection and Beyond. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; PP:1-1. [PMID: 33861691 DOI: 10.1109/tpami.2021.3073689] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Existing RGB-D salient object detection (SOD) models usually treat RGB and depth as independent information and design separate networks for feature extraction from each. Such schemes can easily be constrained by a limited amount of training data or over-reliance on an elaborately designed training process. Inspired by the observation that RGB and depth modalities actually present certain commonality in distinguishing salient objects, a novel joint learning and densely cooperative fusion (JL-DCF) architecture is designed to learn from both RGB and depth inputs through a shared network backbone, known as the Siamese architecture. In this paper, we propose two effective components: joint learning (JL), and densely cooperative fusion (DCF). The JL module provides robust saliency feature learning by exploiting cross-modal commonality via a Siamese network, while the DCF module is introduced for complementary feature discovery. Comprehensive experiments using 5 popular metrics show that the designed framework yields a robust RGB-D saliency detector with good generalization. As a result, JL-DCF significantly advances the SOTAs by an average of ~2.0% (F-measure) across 7 challenging datasets. In addition, we show that JL-DCF is readily applicable to other related multi-modal detection tasks, including RGB-T SOD and video SOD, achieving comparable or better performance.
Collapse
|
58
|
Ma G, Li S, Chen C, Hao A, Qin H. Rethinking Image Salient Object Detection: Object-Level Semantic Saliency Reranking First, Pixelwise Saliency Refinement Later. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4238-4252. [PMID: 33819154 DOI: 10.1109/tip.2021.3068649] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Human attention is an interactive activity between our visual system and our brain, using both low-level visual stimulus and high-level semantic information. Previous image salient object detection (SOD) studies conduct their saliency predictions via a multitask methodology in which pixelwise saliency regression and segmentation-like saliency refinement are conducted simultaneously. However, this multitask methodology has one critical limitation: the semantic information embedded in feature backbones might be degenerated during the training process. Our visual attention is determined mainly by semantic information, which is evidenced by our tendency to pay more attention to semantically salient regions even if these regions are not the most perceptually salient at first glance. This fact clearly contradicts the widely used multitask methodology mentioned above. To address this issue, this paper divides the SOD problem into two sequential steps. First, we devise a lightweight, weakly supervised deep network to coarsely locate the semantically salient regions. Next, as a postprocessing refinement, we selectively fuse multiple off-the-shelf deep models on the semantically salient regions identified by the previous step to formulate a pixelwise saliency map. Compared with the state-of-the-art (SOTA) models that focus on learning the pixelwise saliency in single images using only perceptual clues, our method aims at investigating the object-level semantic ranks between multiple images, of which the methodology is more consistent with the human attention mechanism. Our method is simple yet effective, and it is the first attempt to consider salient object detection as mainly an object-level semantic reranking problem.
Collapse
|
59
|
Wang X, Li S, Chen C, Hao A, Qin H. Depth quality-aware selective saliency fusion for RGB-D image salient object detection. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.071] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
60
|
Li L, Zhao S, Sun R, Chai X, Zheng S, Chen X, Lv Z. AFI-Net: Attention-Guided Feature Integration Network for RGBD Saliency Detection. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:8861446. [PMID: 33859681 PMCID: PMC8026315 DOI: 10.1155/2021/8861446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 03/04/2021] [Accepted: 03/17/2021] [Indexed: 11/17/2022]
Abstract
This article proposes an innovative RGBD saliency model, that is, attention-guided feature integration network, which can extract and fuse features and perform saliency inference. Specifically, the model first extracts multimodal and level deep features. Then, a series of attention modules are deployed to the multilevel RGB and depth features, yielding enhanced deep features. Next, the enhanced multimodal deep features are hierarchically fused. Lastly, the RGB and depth boundary features, that is, low-level spatial details, are added to the integrated feature to perform saliency inference. The key points of the AFI-Net are the attention-guided feature enhancement and the boundary-aware saliency inference, where the attention module indicates salient objects coarsely, and the boundary information is used to equip the deep feature with more spatial details. Therefore, salient objects are well characterized, that is, well highlighted. The comprehensive experiments on five challenging public RGBD datasets clearly exhibit the superiority and effectiveness of the proposed AFI-Net.
Collapse
Affiliation(s)
- Liming Li
- School of Information Science and Technology, Donghua University, Shanghai 201620, China
- School of Urban Railway Transportation, Shanghai University of Engineering Science, Shanghai 201620, China
| | - Shuguang Zhao
- School of Information Science and Technology, Donghua University, Shanghai 201620, China
| | - Rui Sun
- School of Urban Railway Transportation, Shanghai University of Engineering Science, Shanghai 201620, China
| | - Xiaodong Chai
- School of Urban Railway Transportation, Shanghai University of Engineering Science, Shanghai 201620, China
| | - Shubin Zheng
- School of Urban Railway Transportation, Shanghai University of Engineering Science, Shanghai 201620, China
| | - Xingjie Chen
- School of Urban Railway Transportation, Shanghai University of Engineering Science, Shanghai 201620, China
| | - Zhaomin Lv
- School of Urban Railway Transportation, Shanghai University of Engineering Science, Shanghai 201620, China
| |
Collapse
|
61
|
Li G, Liu Z, Chen M, Bai Z, Lin W, Ling H. Hierarchical Alternate Interaction Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:3528-3542. [PMID: 33667161 DOI: 10.1109/tip.2021.3062689] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Existing RGB-D Salient Object Detection (SOD) methods take advantage of depth cues to improve the detection accuracy, while pay insufficient attention to the quality of depth information. In practice, a depth map is often with uneven quality and sometimes suffers from distractors, due to various factors in the acquisition procedure. In this article, to mitigate distractors in depth maps and highlight salient objects in RGB images, we propose a Hierarchical Alternate Interactions Network (HAINet) for RGB-D SOD. Specifically, HAINet consists of three key stages: feature encoding, cross-modal alternate interaction, and saliency reasoning. The main innovation in HAINet is the Hierarchical Alternate Interaction Module (HAIM), which plays a key role in the second stage for cross-modal feature interaction. HAIM first uses RGB features to filter distractors in depth features, and then the purified depth features are exploited to enhance RGB features in turn. The alternate RGB-depth-RGB interaction proceeds in a hierarchical manner, which progressively integrates local and global contexts within a single feature scale. In addition, we adopt a hybrid loss function to facilitate the training of HAINet. Extensive experiments on seven datasets demonstrate that our HAINet not only achieves competitive performance as compared with 19 relevant state-of-the-art methods, but also reaches a real-time processing speed of 43 fps on a single NVIDIA Titan X GPU. The code and results of our method are available at https://github.com/MathLee/HAINet.
Collapse
|
62
|
Jin WD, Xu J, Han Q, Zhang Y, Cheng MM. CDNet: Complementary Depth Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:3376-3390. [PMID: 33646949 DOI: 10.1109/tip.2021.3060167] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Current RGB-D salient object detection (SOD) methods utilize the depth stream as complementary information to the RGB stream. However, the depth maps are usually of low-quality in existing RGB-D SOD datasets. Most RGB-D SOD networks trained with these datasets would produce error-prone results. In this paper, we propose a novel Complementary Depth Network (CDNet) to well exploit saliency-informative depth features for RGB-D SOD. To alleviate the influence of low-quality depth maps to RGB-D SOD, we propose to select saliency-informative depth maps as the training targets and leverage RGB features to estimate meaningful depth maps. Besides, to learn robust depth features for accurate prediction, we propose a new dynamic scheme to fuse the depth features extracted from the original and estimated depth maps with adaptive weights. What's more, we design a two-stage cross-modal feature fusion scheme to well integrate the depth features with the RGB ones, further improving the performance of our CDNet on RGB-D SOD. Experiments on seven benchmark datasets demonstrate that our CDNet outperforms state-of-the-art RGB-D SOD methods. The code is publicly available at https://github.com/blanclist/CDNet.
Collapse
|
63
|
Chen C, Wei J, Peng C, Qin H. Depth-Quality-Aware Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:2350-2363. [PMID: 33481710 DOI: 10.1109/tip.2021.3052069] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The existing fusion-based RGB-D salient object detection methods usually adopt the bistream structure to strike a balance in the fusion trade-off between RGB and depth (D). While the D quality usually varies among the scenes, the state-of-the-art bistream approaches are depth-quality-unaware, resulting in substantial difficulties in achieving complementary fusion status between RGB and D and leading to poor fusion results for low-quality D. Thus, this paper attempts to integrate a novel depth-quality-aware subnet into the classic bistream structure in order to assess the depth quality prior to conducting the selective RGB-D fusion. Compared to the SOTA bistream methods, the major advantage of our method is its ability to lessen the importance of the low-quality, no-contribution, or even negative-contribution D regions during RGB-D fusion, achieving a much improved complementary status between RGB and D. Our source code and data are available online at https://github.com/qdu1995/DQSD.
Collapse
|
64
|
Zhang Z, Lin Z, Xu J, Jin WD, Lu SP, Fan DP. Bilateral Attention Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:1949-1961. [PMID: 33439842 DOI: 10.1109/tip.2021.3049959] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
RGB-D salient object detection (SOD) aims to segment the most attractive objects in a pair of cross-modal RGB and depth images. Currently, most existing RGB-D SOD methods focus on the foreground region when utilizing the depth images. However, the background also provides important information in traditional SOD methods for promising performance. To better explore salient information in both foreground and background regions, this paper proposes a Bilateral Attention Network (BiANet) for the RGB-D SOD task. Specifically, we introduce a Bilateral Attention Module (BAM) with a complementary attention mechanism: foreground-first (FF) attention and background-first (BF) attention. The FF attention focuses on the foreground region with a gradual refinement style, while the BF one recovers potentially useful salient information in the background region. Benefited from the proposed BAM module, our BiANet can capture more meaningful foreground and background cues, and shift more attention to refining the uncertain details between foreground and background regions. Additionally, we extend our BAM by leveraging the multi-scale techniques for better SOD performance. Extensive experiments on six benchmark datasets demonstrate that our BiANet outperforms other state-of-the-art RGB-D SOD methods in terms of objective metrics and subjective visual comparison. Our BiANet can run up to 80 fps on 224×224 RGB-D images, with an NVIDIA GeForce RTX 2080Ti GPU. Comprehensive ablation studies also validate our contributions.
Collapse
|
65
|
Zhou T, Fan DP, Cheng MM, Shen J, Shao L. RGB-D salient object detection: A survey. COMPUTATIONAL VISUAL MEDIA 2021; 7:37-69. [PMID: 33432275 PMCID: PMC7788385 DOI: 10.1007/s41095-020-0199-z] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Accepted: 10/07/2020] [Indexed: 06/12/2023]
Abstract
Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.
Collapse
Affiliation(s)
- Tao Zhou
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | - Deng-Ping Fan
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | | | - Jianbing Shen
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | - Ling Shao
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| |
Collapse
|
66
|
Li C, Cong R, Kwong S, Hou J, Fu H, Zhu G, Zhang D, Huang Q. ASIF-Net: Attention Steered Interweave Fusion Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:88-100. [PMID: 32078571 DOI: 10.1109/tcyb.2020.2969255] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Salient object detection from RGB-D images is an important yet challenging vision task, which aims at detecting the most distinctive objects in a scene by combining color information and depth constraints. Unlike prior fusion manners, we propose an attention steered interweave fusion network (ASIF-Net) to detect salient objects, which progressively integrates cross-modal and cross-level complementarity from the RGB image and corresponding depth map via steering of an attention mechanism. Specifically, the complementary features from RGB-D images are jointly extracted and hierarchically fused in a dense and interweaved manner. Such a manner breaks down the barriers of inconsistency existing in the cross-modal data and also sufficiently captures the complementarity. Meanwhile, an attention mechanism is introduced to locate the potential salient regions in an attention-weighted fashion, which advances in highlighting the salient objects and suppressing the cluttered background regions. Instead of focusing only on pixelwise saliency, we also ensure that the detected salient objects have the objectness characteristics (e.g., complete structure and sharp boundary) by incorporating the adversarial learning that provides a global semantic constraint for RGB-D salient object detection. Quantitative and qualitative experiments demonstrate that the proposed method performs favorably against 17 state-of-the-art saliency detectors on four publicly available RGB-D salient object detection datasets. The code and results of our method are available at https://github.com/Li-Chongyi/ASIF-Net.
Collapse
|
67
|
Zhou W, Pan S, Lei J, Yu L. TMFNet: Three-Input Multilevel Fusion Network for Detecting Salient Objects in RGB-D Images. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2021.3097393] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
68
|
Zhang YF, Zheng J, Li L, Liu N, Jia W, Fan X, Xu C, He X. Rethinking feature aggregation for deep RGB-D salient object detection. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.10.079] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
69
|
Wang X, Li S, Chen C, Fang Y, Hao A, Qin H. Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 30:458-471. [PMID: 33201813 DOI: 10.1109/tip.2020.3037470] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Existing RGB-D salient object detection methods treat depth information as an independent component to complement RGB and widely follow the bistream parallel network architecture. To selectively fuse the CNN features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bistream networks usually consist of two independent subbranches: one subbranch is used for RGB saliency, and the other aims for depth saliency. However, depth saliency is persistently inferior to the RGB saliency because the RGB component is intrinsically more informative than the depth component. The bistream architecture easily biases its subsequent fusion procedure to the RGB subbranch, leading to a performance bottleneck. In this paper, we propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction, where we cyclically convert the original 4-dimensional RGB-D into DGB, RDB and RGD. Then, a newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D, achieving a new SOTA performance.
Collapse
|
70
|
A Hybrid Attention-Aware Fusion Network (HAFNet) for Building Extraction from High-Resolution Imagery and LiDAR Data. REMOTE SENSING 2020. [DOI: 10.3390/rs12223764] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Automated extraction of buildings from earth observation (EO) data has long been a fundamental but challenging research topic. Combining data from different modalities (e.g., high-resolution imagery (HRI) and light detection and ranging (LiDAR) data) has shown great potential in building extraction. Recent studies have examined the role that deep learning (DL) could play in both multimodal data fusion and urban object extraction. However, DL-based multimodal fusion networks may encounter the following limitations: (1) the individual modal and cross-modal features, which we consider both useful and important for final prediction, cannot be sufficiently learned and utilized and (2) the multimodal features are fused by a simple summation or concatenation, which appears ambiguous in selecting cross-modal complementary information. In this paper, we address these two limitations by proposing a hybrid attention-aware fusion network (HAFNet) for building extraction. It consists of RGB-specific, digital surface model (DSM)-specific, and cross-modal streams to sufficiently learn and utilize both individual modal and cross-modal features. Furthermore, an attention-aware multimodal fusion block (Att-MFBlock) was introduced to overcome the fusion problem by adaptively selecting and combining complementary features from each modality. Extensive experiments conducted on two publicly available datasets demonstrated the effectiveness of the proposed HAFNet for building extraction.
Collapse
|
71
|
Li C, Cong R, Guo C, Li H, Zhang C, Zheng F, Zhao Y. A parallel down-up fusion network for salient object detection in optical remote sensing images. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.05.108] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
72
|
Chen H, Li Y, Su D. Discriminative Cross-Modal Transfer Learning and Densely Cross-Level Feedback Fusion for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4808-4820. [PMID: 31484153 DOI: 10.1109/tcyb.2019.2934986] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This article addresses two key issues in RGB-D salient object detection based on the convolutional neural network (CNN). 1) How to bridge the gap between the "data-hungry" nature of CNNs and the insufficient labeled training data in the depth modality? 2) How to take full advantages of the complementary information among two modalities. To solve the first problem, we model the depth-induced saliency detection as a CNN-based cross-modal transfer learning problem. Instead of directly adopting the RGB CNN as initialization, we additionally train a modality classification network (MCNet) to encourage discriminative modality-specific representations in minimizing the modality classification loss. To solve the second problem, we propose a densely cross-level feedback topology, in which the cross-modal complements are combined in each level and then densely fed back to all shallower layers for sufficient cross-level interactions. Compared to traditional two-stream frameworks, the proposed one can better explore, select, and fuse cross-modal cross-level complements. Experiments show the significant and consistent improvements of the proposed CNN framework over other state-of-the-art methods.
Collapse
|
73
|
Chen H, Deng Y, Li Y, Hung TY, Lin G. RGBD Salient Object Detection via Disentangled Cross-modal Fusion. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; PP:8407-8416. [PMID: 32784141 DOI: 10.1109/tip.2020.3014734] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Depth is beneficial for salient object detection (SOD) for its additional saliency cues. Existing RGBD SOD methods focus on tailoring complicated cross-modal fusion topologies, which although achieve encouraging performance, are with a high risk of over-fitting and ambiguous in studying cross-modal complementarity. Different from these conventional approaches combining cross-modal features entirely without differentiating, we concentrate our attention on decoupling the diverse cross-modal complements to simplify the fusion process and enhance the fusion sufficiency. We argue that if cross-modal heterogeneous representations can be disentangled explicitly, the cross-modal fusion process can hold less uncertainty, while enjoying better adaptability. To this end, we design a disentangled cross-modal fusion network to expose structural and content representations from both modalities by cross-modal reconstruction. For different scenes, the disentangled representations allow the fusion module to easily identify, and incorporate desired complements for informative multi-modal fusion. Extensive experiments show the effectiveness of our designs and a large outperformance over state-of-the-art methods.
Collapse
|
74
|
Cong R, Lei J, Fu H, Hou J, Huang Q, Kwong S. Going From RGB to RGBD Saliency: A Depth-Guided Transformation Model. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:3627-3639. [PMID: 31443060 DOI: 10.1109/tcyb.2019.2932005] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Depth information has been demonstrated to be useful for saliency detection. However, the existing methods for RGBD saliency detection mainly focus on designing straightforward and comprehensive models, while ignoring the transferable ability of the existing RGB saliency detection models. In this article, we propose a novel depth-guided transformation model (DTM) going from RGB saliency to RGBD saliency. The proposed model includes three components, that is: 1) multilevel RGBD saliency initialization; 2) depth-guided saliency refinement; and 3) saliency optimization with depth constraints. The explicit depth feature is first utilized in the multilevel RGBD saliency model to initialize the RGBD saliency by combining the global compactness saliency cue and local geodesic saliency cue. The depth-guided saliency refinement is used to further highlight the salient objects and suppress the background regions by introducing the prior depth domain knowledge and prior refined depth shape. Benefiting from the consistency of the entire object in the depth map, we formulate an optimization model to attain more consistent and accurate saliency results via an energy function, which integrates the unary data term, color smooth term, and depth consistency term. Experiments on three public RGBD saliency detection benchmarks demonstrate the effectiveness and performance improvement of the proposed DTM from RGB to RGBD saliency.
Collapse
|
75
|
Zhou S, Wang J, Zhang J, Wang L, Huang D, Du S, Zheng N. Hierarchical U-shape Attention Network for Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; PP:8417-8428. [PMID: 32746244 DOI: 10.1109/tip.2020.3011554] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Salient object detection aims at locating the most conspicuous objects in natural images, which usually acts as a very important pre-processing procedure in many computer vision tasks. In this paper, we propose a simple yet effective Hierarchical U-shape Attention Network (HUAN) to learn a robust mapping function for salient object detection. Firstly, a novel attention mechanism is formulated to improve the well-known U-shape network [1], in which the memory consumption can be extensively reduced and the mask quality can be significantly improved by the resulting U-shape Attention Network (UAN). Secondly, a novel hierarchical structure is constructed to well bridge the low-level and high-level feature representations between different UANs, in which both the intra-network and inter-network connections are considered to explore the salient patterns from a local to global view. Thirdly, a novel Mask Fusion Network (MFN) is designed to fuse the intermediate prediction results, so as to generate a salient mask which is in higher-quality than any of those inputs. Our HUAN can be trained together with any backbone network in an end-to-end manner, and high-quality masks can be finally learned to represent the salient objects. Extensive experimental results on several benchmark datasets show that our method significantly outperforms most of the state-of-the-art approaches.
Collapse
|
76
|
Zhang M, Ji W, Piao Y, Li J, Zhang Y, Xu S, Lu H. LFNet: Light Field Fusion Network for Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:6276-6287. [PMID: 32365027 DOI: 10.1109/tip.2020.2990341] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this work, we propose a novel light field fusion network-LFNet, a CNNs-based light field saliency model using 4D light field data containing abundant spatial and contextual information. The proposed method can reliably locate and identify salient objects even in a complex scene. Our LFNet contains a light field refinement module (LFRM) and a light field integration module (LFIM) which can fully refine and integrate focusness, depths and objectness cues from light field image. The LFRM learns the light field residual between light field and RGB images for refining features with useful light field cues, and then the LFIM weights each refined light field feature and learns spatial correlation between them to predict saliency maps. Our method can take full advantage of light field information and achieve excellent performance especially in complex scenes, e.g., similar foreground and background, multiple or transparent objects and low-contrast environment. Experiments show our method outperforms the state-of-the-art 2D, 3D and 4D methods across three light field datasets.
Collapse
|
77
|
Liu Z, Zhang W, Zhao P. A cross-modal adaptive gated fusion generative adversarial network for RGB-D salient object detection. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.01.045] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
78
|
Li G, Liu Z, Ling H. ICNet: Information Conversion Network for RGB-D Based Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:4873-4884. [PMID: 32149689 DOI: 10.1109/tip.2020.2976689] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RGB-D based salient object detection (SOD) methods leverage the depth map as a valuable complementary information for better SOD performance. Previous methods mainly resort to exploit the correlation between RGB image and depth map in three fusion domains: input images, extracted features, and output results. However, these fusion strategies cannot fully capture the complex correlation between the RGB image and depth map. Besides, these methods do not fully explore the cross-modal complementarity and the cross-level continuity of information, and treat information from different sources without discrimination. In this paper, to address these problems, we propose a novel Information Conversion Network (ICNet) for RGB-D based SOD by employing the siamese structure with encoder-decoder architecture. To fuse high-level RGB and depth features in an interactive and adaptive way, we propose a novel Information Conversion Module (ICM), which contains concatenation operations and correlation layers. Furthermore, we design a Cross-modal Depth-weighted Combination (CDC) block to discriminate the cross-modal features from different sources and to enhance RGB features with depth features at each level. Extensive experiments on five commonly tested datasets demonstrate the superiority of our ICNet over 15 state-of-theart RGB-D based SOD methods, and validate the effectiveness of the proposed ICM and CDC block.
Collapse
|
79
|
Yang G, He Y, Yang Y, Xu B. Fine-Grained Image Classification for Crop Disease Based on Attention Mechanism. FRONTIERS IN PLANT SCIENCE 2020; 11:600854. [PMID: 33414798 PMCID: PMC7783357 DOI: 10.3389/fpls.2020.600854] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 11/30/2020] [Indexed: 05/07/2023]
Abstract
Fine-grained image classification is a challenging task because of the difficulty in identifying discriminant features, it is not easy to find the subtle features that fully represent the object. In the fine-grained classification of crop disease, visual disturbances such as light, fog, overlap, and jitter are frequently encountered. To explore the influence of the features of crop leaf images on the classification results, a classification model should focus on the more discriminative regions of the image while improving the classification accuracy of the model in complex scenes. This paper proposes a novel attention mechanism that effectively utilizes the informative regions of an image, and describes the use of transfer learning to quickly construct several fine-grained image classification models of crop disease based on this attention mechanism. This study uses 58,200 crop leaf images as a dataset, including 14 different crops and 37 different categories of healthy/diseased crops. Among them, different diseases of the same crop have strong similarities. The NASNetLarge fine-grained classification model based on the proposed attention mechanism achieves the best classification effect, with an F 1 score of up to 93.05%. The results show that the proposed attention mechanism effectively improves the fine-grained classification of crop disease images.
Collapse
Affiliation(s)
- Guofeng Yang
- College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou, China
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing, China
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing, China
| | - Yong He
- College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou, China
- *Correspondence: Yong He
| | - Yong Yang
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing, China
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing, China
| | - Beibei Xu
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing, China
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing, China
| |
Collapse
|
80
|
|
81
|
|
82
|
Zhang Q, Huang N, Yao L, Zhang D, Shan C, Han J. RGB-T Salient Object Detection via Fusing Multi-level CNN Features. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:3321-3335. [PMID: 31869791 DOI: 10.1109/tip.2019.2959253] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RGB-induced salient object detection has recently witnessed substantial progress, which is attributed to the superior feature learning capability of deep convolutional neural networks (CNNs). However, such detections suffer from challenging scenarios characterized by cluttered backgrounds, low-light conditions and variations in illumination. Instead of improving RGB based saliency detection, this paper takes advantage of the complementary benefits of RGB and thermal infrared images. Specifically, we propose a novel end-to-end network for multi-modal salient object detection, which turns the challenge of RGB-T saliency detection to a CNN feature fusion problem. To this end, a backbone network (e.g., VGG-16) is first adopted to extract the coarse features from each RGB or thermal infrared image individually, and then several adjacent-depth feature combination (ADFC) modules are designed to extract multi-level refined features for each single-modal input image, considering that features captured at different depths differ in semantic information and visual details. Subsequently, a multi-branch group fusion (MGF) module is employed to capture the cross-modal features by fusing those features from ADFC modules for a RGB-T image pair at each level. Finally, a joint attention guided bi-directional message passing (JABMP) module undertakes the task of saliency prediction via integrating the multi-level fused features from MGF modules. Experimental results on several public RGB-T salient object detection datasets demonstrate the superiorities of our proposed algorithm over the state-of-the-art approaches, especially under challenging conditions, such as poor illumination, complex background and low contrast.
Collapse
|
83
|
Cong R, Lei J, Fu H, Porikli F, Huang Q, Hou C. Video Saliency Detection via Sparsity-Based Reconstruction and Propagation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:4819-4831. [PMID: 31059438 DOI: 10.1109/tip.2019.2910377] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Video saliency detection aims to continuously discover the motion-related salient objects from the video sequences. Since it needs to consider the spatial and temporal constraints jointly, video saliency detection is more challenging than image saliency detection. In this paper, we propose a new method to detect the salient objects in video based on sparse reconstruction and propagation. With the assistance of novel static and motion priors, a single-frame saliency model is first designed to represent the spatial saliency in each individual frame via the sparsity-based reconstruction. Then, through a progressive sparsity-based propagation, the sequential correspondence in the temporal space is captured to produce the inter-frame saliency map. Finally, these two maps are incorporated into a global optimization model to achieve spatio-temporal smoothness and global consistency of the salient object in the whole video. The experiments on three large-scale video saliency datasets demonstrate that the proposed method outperforms the state-of-the-art algorithms both qualitatively and quantitatively.
Collapse
|