1
|
Dang Z, Luo M, Wang J, Jia C, Han H, Wan H, Dai G, Chang X, Wang J. Disentangled Noisy Correspondence Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2602-2615. [PMID: 40257891 DOI: 10.1109/tip.2025.3559457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2025]
Abstract
Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of modality-invariant information (MII) and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal inputs for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.
Collapse
|
2
|
Zhou W, Guo Q, Lei J, Yu L, Hwang JN. IRFR-Net: Interactive Recursive Feature-Reshaping Network for Detecting Salient Objects in RGB-D Images. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4132-4144. [PMID: 34415839 DOI: 10.1109/tnnls.2021.3105484] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Using attention mechanisms in saliency detection networks enables effective feature extraction, and using linear methods can promote proper feature fusion, as verified in numerous existing models. Current networks usually combine depth maps with red-green-blue (RGB) images for salient object detection (SOD). However, fully leveraging depth information complementary to RGB information by accurately highlighting salient objects deserves further study. We combine a gated attention mechanism and a linear fusion method to construct a dual-stream interactive recursive feature-reshaping network (IRFR-Net). The streams for RGB and depth data communicate through a backbone encoder to thoroughly extract complementary information. First, we design a context extraction module (CEM) to obtain low-level depth foreground information. Subsequently, the gated attention fusion module (GAFM) is applied to the RGB depth (RGB-D) information to obtain advantageous structural and spatial fusion features. Then, adjacent depth information is globally integrated to obtain complementary context features. We also introduce a weighted atrous spatial pyramid pooling (WASPP) module to extract the multiscale local information of depth features. Finally, global and local features are fused in a bottom-up scheme to effectively highlight salient objects. Comprehensive experiments on eight representative datasets demonstrate that the proposed IRFR-Net outperforms 11 state-of-the-art (SOTA) RGB-D approaches in various evaluation indicators.
Collapse
|
3
|
Lv C, Wan B, Zhou X, Sun Y, Zhang J, Yan C. Lightweight Cross-Modal Information Mutual Reinforcement Network for RGB-T Salient Object Detection. ENTROPY (BASEL, SWITZERLAND) 2024; 26:130. [PMID: 38392385 PMCID: PMC10888287 DOI: 10.3390/e26020130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 01/26/2024] [Accepted: 01/29/2024] [Indexed: 02/24/2024]
Abstract
RGB-T salient object detection (SOD) has made significant progress in recent years. However, most existing works are based on heavy models, which are not applicable to mobile devices. Additionally, there is still room for improvement in the design of cross-modal feature fusion and cross-level feature fusion. To address these issues, we propose a lightweight cross-modal information mutual reinforcement network for RGB-T SOD. Our network consists of a lightweight encoder, the cross-modal information mutual reinforcement (CMIMR) module, and the semantic-information-guided fusion (SIGF) module. To reduce the computational cost and the number of parameters, we employ the lightweight module in both the encoder and decoder. Furthermore, to fuse the complementary information between two-modal features, we design the CMIMR module to enhance the two-modal features. This module effectively refines the two-modal features by absorbing previous-level semantic information and inter-modal complementary information. In addition, to fuse the cross-level feature and detect multiscale salient objects, we design the SIGF module, which effectively suppresses the background noisy information in low-level features and extracts multiscale information. We conduct extensive experiments on three RGB-T datasets, and our method achieves competitive performance compared to the other 15 state-of-the-art methods.
Collapse
Affiliation(s)
- Chengtao Lv
- School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Bin Wan
- School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Xiaofei Zhou
- School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Yaoqi Sun
- School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China
- Lishui Institute, Hangzhou Dianzi University, Lishui 323000, China
| | - Jiyong Zhang
- School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Chenggang Yan
- School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China
| |
Collapse
|
4
|
Xu K, Guo J. RGB-D salient object detection via convolutional capsule network based on feature extraction and integration. Sci Rep 2023; 13:17652. [PMID: 37848501 PMCID: PMC10582015 DOI: 10.1038/s41598-023-44698-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Accepted: 10/11/2023] [Indexed: 10/19/2023] Open
Abstract
Fully convolutional neural network has shown advantages in the salient object detection by using the RGB or RGB-D images. However, there is an object-part dilemma since most fully convolutional neural network inevitably leads to an incomplete segmentation of the salient object. Although the capsule network is capable of recognizing a complete object, it is highly computational demand and time consuming. In this paper, we propose a novel convolutional capsule network based on feature extraction and integration for dealing with the object-part relationship, with less computation demand. First and foremost, RGB features are extracted and integrated by using the VGG backbone and feature extraction module. Then, these features, integrating with depth images by using feature depth module, are upsampled progressively to produce a feature map. In the next step, the feature map is fed into the feature-integrated convolutional capsule network to explore the object-part relationship. The proposed capsule network extracts object-part information by using convolutional capsules with locally-connected routing and predicts the final salient map based on the deconvolutional capsules. Experimental results on four RGB-D benchmark datasets show that our proposed method outperforms 23 state-of-the-art algorithms.
Collapse
Affiliation(s)
- Kun Xu
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300000, People's Republic of China
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250014, People's Republic of China
- Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, China
| | - Jichang Guo
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300000, People's Republic of China.
| |
Collapse
|
5
|
Gao R, Tao Y, Yu Y, Wu J, Shao X, Li J, Ye Z. Self-supervised Dual Hypergraph learning with Intent Disentanglement for session-based recommendation. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
|
6
|
Shan D, Zhang Y, Liu X, Liu S, Coleman SA, Kerr D. MMPL-Net: multi-modal prototype learning for one-shot RGB-D segmentation. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08235-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/04/2023]
|
7
|
Pang Y, Zhao X, Zhang L, Lu H. CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:892-904. [PMID: 37018701 DOI: 10.1109/tip.2023.3234702] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components.
Collapse
|
8
|
Wei L, Zong G. EGA-Net: Edge Feature Enhancement and Global Information Attention Network for RGB-D Salient Object Detection. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
9
|
Wen H, Song K, Huang L, Wang H, Yan Y. Cross-modality salient object detection network with universality and anti-interference. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
10
|
Li J, Ji W, Zhang M, Piao Y, Lu H, Cheng L. Delving into Calibrated Depth for Accurate RGB-D Salient Object Detection. Int J Comput Vis 2022. [DOI: 10.1007/s11263-022-01734-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
11
|
Li Z, Lang C, Li G, Wang T, Li Y. Depth Guided Feature Selection for RGBD Salient Object Detection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.11.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
12
|
Zong G, Wei L, Guo S, Wang Y. A cascaded refined rgb-d salient object detection network based on the attention mechanism. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04186-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
13
|
Bi H, Wu R, Liu Z, Zhang J, Zhang C, Xiang TZ, Wang X. PSNet: Parallel symmetric network for RGB-T salient object detection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
14
|
Song M, Song W, Yang G, Chen C. Improving RGB-D Salient Object Detection via Modality-Aware Decoder. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6124-6138. [PMID: 36112559 DOI: 10.1109/tip.2022.3205747] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Most existing RGB-D salient object detection (SOD) methods are primarily focusing on cross-modal and cross-level saliency fusion, which has been proved to be efficient and effective. However, these methods still have a critical limitation, i.e., their fusion patterns - typically the combination of selective characteristics and its variations, are too highly dependent on the network's non-linear adaptability. In such methods, the balances between RGB and D (Depth) are formulated individually considering the intermediate feature slices, but the relation at the modality level may not be learned properly. The optimal RGB-D combinations differ depending on the RGB-D scenarios, and the exact complementary status is frequently determined by multiple modality-level factors, such as D quality, the complexity of the RGB scene, and degree of harmony between them. Therefore, given the existing approaches, it may be difficult for them to achieve further performance breakthroughs, as their methodologies belong to some methods that are somewhat less modality sensitive. To conquer this problem, this paper presents the Modality-aware Decoder (MaD). The critical technical innovations include a series of feature embedding, modality reasoning, and feature back-projecting and collecting strategies, all of which upgrade the widely-used multi-scale and multi-level decoding process to be modality-aware. Our MaD achieves competitive performance over other state-of-the-art (SOTA) models without using any fancy tricks in the decoder's design. Codes and results will be publicly available at https://github.com/MengkeSong/MaD.
Collapse
|
15
|
Zhang M, Xu S, Piao Y, Lu H. Exploring Spatial Correlation for Light Field Saliency Detection: Expansion From a Single View. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6152-6163. [PMID: 36112561 DOI: 10.1109/tip.2022.3205749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Previous 2D saliency detection methods extract salient cues from a single view and directly predict the expected results. Both traditional and deep-learning-based 2D methods do not consider geometric information of 3D scenes. Therefore the relationship between scene understanding and salient objects cannot be effectively established. This limits the performance of 2D saliency detection in challenging scenes. In this paper, we show for the first time that saliency detection problem can be reformulated as two sub-problems: light field synthesis from a single view and light-field-driven saliency detection. This paper first introduces a high-quality light field synthesis network to produce reliable 4D light field information. Then a novel light-field-driven saliency detection network is proposed, in which a Direction-specific Screening Unit (DSU) is tailored to exploit the spatial correlation among multiple viewpoints. The whole pipeline can be trained in an end-to-end fashion. Experimental results demonstrate that the proposed method outperforms the state-of-the-art 2D, 3D and 4D saliency detection methods. Our code is publicly available at https://github.com/OIPLab-DUT/ESCNet.
Collapse
|
16
|
Sun P, Zhang W, Li S, Guo Y, Song C, Li X. Learnable Depth-Sensitive Attention for Deep RGB-D Saliency Detection with Multi-modal Fusion Architecture Search. Int J Comput Vis 2022. [DOI: 10.1007/s11263-022-01646-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
17
|
Guan W, Wen H, Song X, Wang C, Yeh CH, Chang X, Nie L. Partially Supervised Compatibility Modeling. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4733-4745. [PMID: 35793293 DOI: 10.1109/tip.2022.3187290] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Fashion Compatibility Modeling (FCM), which aims to automatically evaluate whether a given set of fashion items makes a compatible outfit, has attracted increasing research attention. Recent studies have demonstrated the benefits of conducting the item representation disentanglement towards FCM. Although these efforts have achieved prominent progress, they still perform unsatisfactorily, as they mainly investigate the visual content of fashion items, while overlooking the semantic attributes of items (e.g., color and pattern), which could largely boost the model performance and interpretability. To address this issue, we propose to comprehensively explore the visual content and attributes of fashion items towards FCM. This problem is non-trivial considering the following challenges: a) how to utilize the irregular attribute labels of items to partially supervise the attribute-level representation learning of fashion items; b) how to ensure the intact disentanglement of attribute-level representations; and c) how to effectively sew the multiple granulairites (i.e, coarse-grained item-level and fine-grained attribute-level) information to enable performance improvement and interpretability. To address these challenges, in this work, we present a partially supervised outfit compatibility modeling scheme (PS-OCM). In particular, we first devise a partially supervised attribute-level embedding learning component to disentangle the fine-grained attribute embeddings from the entire visual feature of each item. We then introduce a disentangled completeness regularizer to prevent the information loss during disentanglement. Thereafter, we design a hierarchical graph convolutional network, which seamlessly integrates the attribute- and item-level compatibility modeling, and enables the explainable compatibility reasoning. Extensive experiments on the real-world dataset demonstrate that our PS-OCM significantly outperforms the state-of-the-art baselines. We have released our source codes and well-trained models to benefit other researchers (https://site2750.wixsite.com/ps-ocm).
Collapse
|
18
|
A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection. ELECTRONICS 2022. [DOI: 10.3390/electronics11131968] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
RGB-D salient object detection (SOD) aims at locating the most eye-catching object in visual input by fusing complementary information of RGB modality and depth modality. Most of the existing RGB-D SOD methods integrate multi-modal features to generate the saliency map indiscriminately, ignoring the ambiguity between different modalities. To better use multi-modal complementary information and alleviate the negative impact of ambiguity among different modalities, this paper proposes a novel Alternate Steered Attention and Trapezoidal Pyramid Fusion Network (A2TPNet) for RGB-D SOD composed of Cross-modal Alternate Fusion Module (CAFM) and Trapezoidal Pyramid Fusion Module (TPFM). CAFM is focused on fusing cross-modal features, taking full consideration of the ambiguity between cross-modal data by an Alternate Steered Attention (ASA), and it reduces the interference of redundant information and non-salient features in the interactive process through a collaboration mechanism containing channel attention and spatial attention. TPFM endows the RGB-D SOD model with more powerful feature expression capabilities by combining multi-scale features to enhance the expressive ability of contextual semantics of the model. Extensive experimental results on five publicly available datasets demonstrate that the proposed model consistently outperforms 17 state-of-the-art methods.
Collapse
|
19
|
Zhao Z, Huang Z, Chai X, Wang J. Depth Enhanced Cross-Modal Cascaded Network for RGB-D Salient Object Detection. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10886-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
20
|
Xu Y, Yu X, Zhang J, Zhu L, Wang D. Weakly Supervised RGB-D Salient Object Detection With Prediction Consistency Training and Active Scribble Boosting. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2148-2161. [PMID: 35196231 DOI: 10.1109/tip.2022.3151999] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
RGB-D salient object detection (SOD) has attracted increasingly more attention as it shows more robust results in complex scenes compared with RGB SOD. However, state-of-the-art RGB-D SOD approaches heavily rely on a large amount of pixel-wise annotated data for training. Such densely labeled annotations are often labor-intensive and costly. To reduce the annotation burden, we investigate RGB-D SOD from a weakly supervised perspective. More specifically, we use annotator-friendly scribble annotations as supervision signals for model training. Since scribble annotations are much sparser compared to ground-truth masks, some critical object structure information might be neglected. To preserve such structure information, we explicitly exploit the complementary edge information from two modalities (i.e., RGB and depth). Specifically, we leverage the dual-modal edge guidance and introduce a new network architecture with a dual-edge detection module and a modality-aware feature fusion module. In order to use the useful information of unlabeled pixels, we introduce a prediction consistency training scheme by comparing the predictions of two networks optimized by different strategies. Moreover, we develop an active scribble boosting strategy to provide extra supervision signals with negligible annotation cost, leading to significant SOD performance improvement. Extensive experiments on seven benchmarks validate the superiority of our proposed method. Remarkably, the proposed method with scribble annotations achieves competitive performance in comparison to fully supervised state-of-the-art methods.
Collapse
|
21
|
Zhao Y, Zhao J, Li J, Chen X. RGB-D Salient Object Detection With Ubiquitous Target Awareness. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:7717-7731. [PMID: 34478368 DOI: 10.1109/tip.2021.3108412] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Conventional RGB-D salient object detection methods aim to leverage depth as complementary information to find the salient regions in both modalities. However, the salient object detection results heavily rely on the quality of captured depth data which sometimes are unavailable. In this work, we make the first attempt to solve the RGB-D salient object detection problem with a novel depth-awareness framework. This framework only relies on RGB data in the testing phase, utilizing captured depth data as supervision for representation learning. To construct our framework as well as achieving accurate salient detection results, we propose a Ubiquitous Target Awareness (UTA) network to solve three important challenges in RGB-D SOD task: 1) a depth awareness module to excavate depth information and to mine ambiguous regions via adaptive depth-error weights, 2) a spatial-aware cross-modal interaction and a channel-aware cross-level interaction, exploiting the low-level boundary cues and amplifying high-level salient channels, and 3) a gated multi-scale predictor module to perceive the object saliency in different contextual scales. Besides its high performance, our proposed UTA network is depth-free for inference and runs in real-time with 43 FPS. Experimental evidence demonstrates that our proposed network not only surpasses the state-of-the-art methods on five public RGB-D SOD benchmarks by a large margin, but also verifies its extensibility on five public RGB SOD benchmarks.
Collapse
|
22
|
Deng Y, Chen H, Chen H, Li Y. Learning From Images: A Distillation Learning Framework for Event Cameras. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4919-4931. [PMID: 33961557 DOI: 10.1109/tip.2021.3077136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Event cameras have recently drawn massive attention in the computer vision community because of their low power consumption and high response speed. These cameras produce sparse and non-uniform spatiotemporal representations of a scene. These characteristics of representations make it difficult for event-based models to extract discriminative cues (such as textures and geometric relationships). Consequently, event-based methods usually perform poorly compared to their conventional image counterparts. Considering that traditional images and event signals share considerable visual information, this paper aims to improve the feature extraction ability of event-based models by using knowledge distilled from the image domain to additionally provide explicit feature-level supervision for the learning of event data. Specifically, we propose a simple yet effective distillation learning framework, including multi-level customized knowledge distillation constraints. Our framework can significantly boost the feature extraction process for event data and is applicable to various downstream tasks. We evaluate our framework on high-level and low-level tasks, i.e., object classification and optical flow prediction. Experimental results show that our framework can effectively improve the performance of event-based models on both tasks by a large margin. Furthermore, we present a 10K dataset (CEP-DVS) for event-based object classification. This dataset consists of samples recorded under random motion trajectories that can better evaluate the motion robustness of the event-based model and is compatible with multi-modality vision tasks.
Collapse
|
23
|
Li G, Liu Z, Chen M, Bai Z, Lin W, Ling H. Hierarchical Alternate Interaction Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:3528-3542. [PMID: 33667161 DOI: 10.1109/tip.2021.3062689] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Existing RGB-D Salient Object Detection (SOD) methods take advantage of depth cues to improve the detection accuracy, while pay insufficient attention to the quality of depth information. In practice, a depth map is often with uneven quality and sometimes suffers from distractors, due to various factors in the acquisition procedure. In this article, to mitigate distractors in depth maps and highlight salient objects in RGB images, we propose a Hierarchical Alternate Interactions Network (HAINet) for RGB-D SOD. Specifically, HAINet consists of three key stages: feature encoding, cross-modal alternate interaction, and saliency reasoning. The main innovation in HAINet is the Hierarchical Alternate Interaction Module (HAIM), which plays a key role in the second stage for cross-modal feature interaction. HAIM first uses RGB features to filter distractors in depth features, and then the purified depth features are exploited to enhance RGB features in turn. The alternate RGB-depth-RGB interaction proceeds in a hierarchical manner, which progressively integrates local and global contexts within a single feature scale. In addition, we adopt a hybrid loss function to facilitate the training of HAINet. Extensive experiments on seven datasets demonstrate that our HAINet not only achieves competitive performance as compared with 19 relevant state-of-the-art methods, but also reaches a real-time processing speed of 43 fps on a single NVIDIA Titan X GPU. The code and results of our method are available at https://github.com/MathLee/HAINet.
Collapse
|
24
|
Jin WD, Xu J, Han Q, Zhang Y, Cheng MM. CDNet: Complementary Depth Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:3376-3390. [PMID: 33646949 DOI: 10.1109/tip.2021.3060167] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Current RGB-D salient object detection (SOD) methods utilize the depth stream as complementary information to the RGB stream. However, the depth maps are usually of low-quality in existing RGB-D SOD datasets. Most RGB-D SOD networks trained with these datasets would produce error-prone results. In this paper, we propose a novel Complementary Depth Network (CDNet) to well exploit saliency-informative depth features for RGB-D SOD. To alleviate the influence of low-quality depth maps to RGB-D SOD, we propose to select saliency-informative depth maps as the training targets and leverage RGB features to estimate meaningful depth maps. Besides, to learn robust depth features for accurate prediction, we propose a new dynamic scheme to fuse the depth features extracted from the original and estimated depth maps with adaptive weights. What's more, we design a two-stage cross-modal feature fusion scheme to well integrate the depth features with the RGB ones, further improving the performance of our CDNet on RGB-D SOD. Experiments on seven benchmark datasets demonstrate that our CDNet outperforms state-of-the-art RGB-D SOD methods. The code is publicly available at https://github.com/blanclist/CDNet.
Collapse
|
25
|
Zhou T, Fan DP, Cheng MM, Shen J, Shao L. RGB-D salient object detection: A survey. COMPUTATIONAL VISUAL MEDIA 2021; 7:37-69. [PMID: 33432275 PMCID: PMC7788385 DOI: 10.1007/s41095-020-0199-z] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Accepted: 10/07/2020] [Indexed: 06/12/2023]
Abstract
Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.
Collapse
Affiliation(s)
- Tao Zhou
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | - Deng-Ping Fan
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | | | - Jianbing Shen
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | - Ling Shao
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| |
Collapse
|