1
|
Duan S, Yang X, Wang N, Gao X. Lightweight RGB-D Salient Object Detection From a Speed-Accuracy Tradeoff Perspective. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2529-2543. [PMID: 40249695 DOI: 10.1109/tip.2025.3560488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/20/2025]
Abstract
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS. The code is available at https://github.com/duan-song/SATNet.
Collapse
|
2
|
Yang Z, Cao Z, Cao J, Chen Z, Peng C. Multibranch semantic image segmentation model based on edge optimization and category perception. PLoS One 2024; 19:e0315621. [PMID: 39700236 DOI: 10.1371/journal.pone.0315621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Accepted: 11/27/2024] [Indexed: 12/21/2024] Open
Abstract
In semantic image segmentation tasks, most methods fail to fully use the characteristics of different scales and levels but rather directly perform upsampling. This may cause some effective information to be mistaken for redundant information and discarded, which in turn causes object segmentation confusion. As a convolutional layer deepens, the loss of spatial detail information makes the segmentation effect achieved at the object boundary insufficiently accurate. To address the above problems, we propose an edge optimization and category-aware multibranch semantic segmentation network (ECMNet). First, an attention-guided multibranch fusion backbone network is used to connect features with different resolutions in parallel and perform multiscale information interaction to reduce the loss of spatial detail information. Second, a category perception module is used to learn category feature representations and guide the pixel classification process through an attention mechanism to optimize the resulting segmentation accuracy. Finally, an edge optimization module is used to integrate the edge features into the middle and the deep supervision layers of the network through an adaptive algorithm to enhance its ability to express edge features and optimize the edge segmentation effect. The experimental results show that the MIoU value reaches 79.2% on the Cityspaces dataset and 79.6% on the CamVid dataset, that the number of parameters is significantly lower than those of other models, and that the proposed method can effectively achieve improved semantic image segmentation performance and solve the partial category segmentation confusion problem, giving it certain application prospects.
Collapse
Affiliation(s)
- Zhuolin Yang
- Department of Computer Science and Technology, Xinzhou Normal University, Xinzhou, China
- School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan, China
| | - Zhen Cao
- Department of Computer Science and Technology, Xinzhou Normal University, Xinzhou, China
- School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan, China
| | - Jianfang Cao
- Department of Computer Science and Technology, Xinzhou Normal University, Xinzhou, China
- School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan, China
| | - Zhiqiang Chen
- Department of Big Data and Intelligent Engineering, Shanxi Institute of Technology, Yangquan, China
| | - Cunhe Peng
- Department of Computer Science and Technology, Xinzhou Normal University, Xinzhou, China
- School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan, China
| |
Collapse
|
3
|
Tang Y, Li M. DMGNet: Depth mask guiding network for RGB-D salient object detection. Neural Netw 2024; 180:106751. [PMID: 39332209 DOI: 10.1016/j.neunet.2024.106751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 07/26/2024] [Accepted: 09/19/2024] [Indexed: 09/29/2024]
Abstract
Though depth images can provide supplementary spatial structural cues for salient object detection (SOD) task, inappropriate utilization of depth features may introduce noisy or misleading features, which may greatly destroy SOD performance. To address this issue, we propose a depth mask guiding network (DMGNet) for RGB-D SOD. In this network, a depth mask guidance module (DMGM) is designed to pre-segment the salient objects from depth images and then create masks using pre-segmented objects to guide the RGB subnetwork to extract more discriminative features. Furthermore, a feature fusion pyramid module (FFPM) is employed to acquire more informative fused features using multi-branch convolutional channels with varying receptive fields, further enhancing the fusion of cross-modal features. Extensive experiments on nine benchmark datasets demonstrate the effectiveness of the proposed network.
Collapse
Affiliation(s)
- Yinggan Tang
- School of Electrical Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China; Key Laboratory of Intelligent Rehabilitation and Neromodulation of Hebei Province, Yanshan University, Qinhuangdao, Hebei 066004, China; Key Laboratory of Industrial Computer Control Engineering of Hebei Province, Yanshan University, Qinhuangdao, Hebei 066004, China.
| | - Mengyao Li
- School of Electrical Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China.
| |
Collapse
|
4
|
Tong Y, Chen Z, Zhou Z, Hu Y, Li X, Qiao X. An Edge-Enhanced Network for Polyp Segmentation. Bioengineering (Basel) 2024; 11:959. [PMID: 39451335 PMCID: PMC11504364 DOI: 10.3390/bioengineering11100959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 09/19/2024] [Accepted: 09/23/2024] [Indexed: 10/26/2024] Open
Abstract
Colorectal cancer remains a leading cause of cancer-related deaths worldwide, with early detection and removal of polyps being critical in preventing disease progression. Automated polyp segmentation, particularly in colonoscopy images, is a challenging task due to the variability in polyp appearance and the low contrast between polyps and surrounding tissues. In this work, we propose an edge-enhanced network (EENet) designed to address these challenges by integrating two novel modules: the covariance edge-enhanced attention (CEEA) and cross-scale edge enhancement (CSEE) modules. The CEEA module leverages covariance-based attention to enhance boundary detection, while the CSEE module bridges multi-scale features to preserve fine-grained edge details. To further improve the accuracy of polyp segmentation, we introduce a hybrid loss function that combines cross-entropy loss with edge-aware loss. Extensive experiments show that the EENet achieves a Dice score of 0.9208 and an IoU of 0.8664 on the Kvasir-SEG dataset, surpassing state-of-the-art models such as Polyp-PVT and PraNet. Furthermore, it records a Dice score of 0.9316 and an IoU of 0.8817 on the CVC-ClinicDB dataset, demonstrating its strong potential for clinical application in polyp segmentation. Ablation studies further validate the contribution of the CEEA and CSEE modules.
Collapse
Affiliation(s)
- Yao Tong
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing 210023, China; (Y.T.); (Z.Z.); (Y.H.)
- Jiangsu Province Engineering Research Center of TCM Intelligence Health Service, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Ziqi Chen
- Vanke School of Public Health, Tsinghua University, Beijing 100084, China;
| | - Zuojian Zhou
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing 210023, China; (Y.T.); (Z.Z.); (Y.H.)
- Jiangsu Province Engineering Research Center of TCM Intelligence Health Service, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Yun Hu
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing 210023, China; (Y.T.); (Z.Z.); (Y.H.)
- Jiangsu Province Engineering Research Center of TCM Intelligence Health Service, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Xin Li
- College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China;
| | - Xuebin Qiao
- Jiangsu Province Engineering Research Center of TCM Intelligence Health Service, Nanjing University of Chinese Medicine, Nanjing 210023, China
- School of Elderly Care Services and Management, Nanjing University of Chinese Medicine, Nanjing 210023, China
| |
Collapse
|
5
|
Pei J, Jiang T, Tang H, Liu N, Jin Y, Fan DP, Heng PA. CalibNet: Dual-Branch Cross-Modal Calibration for RGB-D Salient Instance Segmentation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:4348-4362. [PMID: 39074016 DOI: 10.1109/tip.2024.3432328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
In this study, we propose a novel approach for RGB-D salient instance segmentation using a dual-branch cross-modal feature calibration architecture called CalibNet. Our method simultaneously calibrates depth and RGB features in the kernel and mask branches to generate instance-aware kernels and mask features. CalibNet consists of three simple modules, a dynamic interactive kernel (DIK) and a weight-sharing fusion (WSF), which work together to generate effective instance-aware kernels and integrate cross-modal features. To improve the quality of depth features, we incorporate a depth similarity assessment (DSA) module prior to DIK and WSF. In addition, we further contribute a new DSIS dataset, which contains 1,940 images with elaborate instance-level annotations. Extensive experiments on three challenging benchmarks show that CalibNet yields a promising result, i.e., 58.0% AP with 320×480 input size on the COME15K-E test set, which significantly surpasses the alternative frameworks. Our code and dataset will be publicly available at: https://github.com/PJLallen/CalibNet.
Collapse
|
6
|
Peng D, Zhou W, Pan J, Wang D. MSEDNet: Multi-scale fusion and edge-supervised network for RGB-T salient object detection. Neural Netw 2024; 171:410-422. [PMID: 38141476 DOI: 10.1016/j.neunet.2023.12.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 12/05/2023] [Accepted: 12/18/2023] [Indexed: 12/25/2023]
Abstract
RGB-T Salient object detection (SOD) is to accurately segment salient regions in both visible light images and thermal infrared images. However, most of existing methods for SOD neglects the critical complementarity between multiple modalities images, which is beneficial to further improve the detection accuracy. Therefore, this work introduces the MSEDNet RGB-T SOD method. We utilize an encoder to extract multi-level modalities features from both visible light images and thermal infrared images, which are subsequently categorized into high, medium, and low level. Additionally, we propose three separate feature fusion modules to comprehensively extract complementary information between different modalities during the fusion process. These modules are applied to specific feature levels: the Edge Dilation Sharpening module for low-level features, the Spatial and Channel-Aware module for mid-level features, and the Cross-Residual Fusion module for high-level features. Finally, we introduce an edge fusion loss function for supervised learning, which effectively extracts edge information from different modalities and suppresses background noise. Comparative demonstrate the superiority of the proposed MSEDNet over other state-of-the-art methods. The code and results can be found at the following link: https://github.com/Zhou-wy/MSEDNet.
Collapse
Affiliation(s)
- Daogang Peng
- College of Automation Engineering, Shanghai University of Electric Power, 200090, 2588 Changyang Road, Yangpu, Shanghai, China.
| | - Weiyi Zhou
- College of Automation Engineering, Shanghai University of Electric Power, 200090, 2588 Changyang Road, Yangpu, Shanghai, China.
| | - Junzhen Pan
- College of Automation Engineering, Shanghai University of Electric Power, 200090, 2588 Changyang Road, Yangpu, Shanghai, China
| | - Danhao Wang
- College of Automation Engineering, Shanghai University of Electric Power, 200090, 2588 Changyang Road, Yangpu, Shanghai, China
| |
Collapse
|
7
|
Zhou T, Zhang X, Lu H, Li Q, Liu L, Zhou H. GMRE-iUnet: Isomorphic Unet fusion model for PET and CT lung tumor images. Comput Biol Med 2023; 166:107514. [PMID: 37826951 DOI: 10.1016/j.compbiomed.2023.107514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2023] [Revised: 08/25/2023] [Accepted: 09/19/2023] [Indexed: 10/14/2023]
Abstract
Lung tumor PET and CT image fusion is a key technology in clinical diagnosis. However, the existing fusion methods are difficult to obtain fused images with high contrast, prominent morphological features, and accurate spatial localization. In this paper, an isomorphic Unet fusion model (GMRE-iUnet) for lung tumor PET and CT images is proposed to address the above problems. The main idea of this network is as following: Firstly, this paper constructs an isomorphic Unet fusion network, which contains two independent multiscale dual encoders Unet, it can capture the features of the lesion region, spatial localization, and enrich the morphological information. Secondly, a Hybrid CNN-Transformer feature extraction module (HCTrans) is constructed to effectively integrate local lesion features and global contextual information. In addition, the residual axial attention feature compensation module (RAAFC) is embedded into the Unet to capture fine-grained information as compensation features, which makes the model focus on local connections in neighboring pixels. Thirdly, a hybrid attentional feature fusion module (HAFF) is designed for multiscale feature information fusion, it aggregates edge information and detail representations using local entropy and Gaussian filtering. Finally, the experiment results on the multimodal lung tumor medical image dataset show that the model in this paper can achieve excellent fusion performance compared with other eight fusion models. In CT mediastinal window images and PET images comparison experiment, AG, EI, QAB/F, SF, SD, and IE indexes are improved by 16.19%, 26%, 3.81%, 1.65%, 3.91% and 8.01%, respectively. GMRE-iUnet can highlight the information and morphological features of the lesion areas and provide practical help for the aided diagnosis of lung tumors.
Collapse
Affiliation(s)
- Tao Zhou
- School of Computer Science and Engineering, North Minzu University, Yinchuan, 750021, China; Key Laboratory of Image and Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University, Yinchuan, 750021, China
| | - Xiangxiang Zhang
- School of Computer Science and Engineering, North Minzu University, Yinchuan, 750021, China.
| | - Huiling Lu
- School of Medical Information & Engineering, Ningxia Medical University, Yinchuan, 750004, China.
| | - Qi Li
- School of Computer Science and Engineering, North Minzu University, Yinchuan, 750021, China
| | - Long Liu
- School of Computer Science and Engineering, North Minzu University, Yinchuan, 750021, China
| | - Huiyu Zhou
- School of Computing and Mathematical Sciences, University of Leicester, LE1 7RH, United Kingdom
| |
Collapse
|
8
|
Liu Z, Hayat M, Yang H, Peng D, Lei Y. Deep Hypersphere Feature Regularization for Weakly Supervised RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:5423-5437. [PMID: 37773910 DOI: 10.1109/tip.2023.3318953] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/01/2023]
Abstract
We propose a weakly supervised approach for salient object detection from multi-modal RGB-D data. Our approach only relies on labels from scribbles, which are much easier to annotate, compared with dense labels used in conventional fully supervised setting. In contrast to existing methods that employ supervision signals on the output space, our design regularizes the intermediate latent space to enhance discrimination between salient and non-salient objects. We further introduce a contour detection branch to implicitly constrain the semantic boundaries and achieve precise edges of detected salient objects. To enhance the long-range dependencies among local features, we introduce a Cross-Padding Attention Block (CPAB). Extensive experiments on seven benchmark datasets demonstrate that our method not only outperforms existing weakly supervised methods, but is also on par with several fully-supervised state-of-the-art models. Code is available at https://github.com/leolyj/DHFR-SOD.
Collapse
|
9
|
Yu H, Li Z, Li W, Guo W, Li D, Wang L, Wu M, Wang Y. A Tiny Object Detection Approach for Maize Cleaning Operations. Foods 2023; 12:2885. [PMID: 37569154 PMCID: PMC10418751 DOI: 10.3390/foods12152885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Revised: 07/10/2023] [Accepted: 07/27/2023] [Indexed: 08/13/2023] Open
Abstract
Real-time and accurate awareness of the grain situation proves beneficial for making targeted and dynamic adjustments to cleaning parameters and strategies, leading to efficient and effective removal of impurities with minimal losses. In this study, harvested maize was employed as the raw material, and a specialized object detection network focused on impurity-containing maize images was developed to determine the types and distribution of impurities during the cleaning operations. On the basis of the classic contribution Faster Region Convolutional Neural Network, EfficientNetB7 was introduced as the backbone of the feature learning network and a cross-stage feature integration mechanism was embedded to obtain the global features that contained multi-scale mappings. The spatial information and semantic descriptions of feature matrices from different hierarchies could be fused through continuous convolution and upsampling operations. At the same time, taking into account the geometric properties of the objects to be detected and combining the images' resolution, the adaptive region proposal network (ARPN) was designed and utilized to generate candidate boxes with appropriate sizes for the detectors, which was beneficial to the capture and localization of tiny objects. The effectiveness of the proposed tiny object detection model and each improved component were validated through ablation experiments on the constructed RGB impurity-containing image datasets.
Collapse
Affiliation(s)
- Haoze Yu
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Engineering, China Agricultural University, 17 Qinghua Donglu, P.O. Box 50, Beijing 100083, China; (H.Y.); (W.L.); (W.G.); (M.W.)
| | - Zhuangzi Li
- School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China;
| | - Wei Li
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Engineering, China Agricultural University, 17 Qinghua Donglu, P.O. Box 50, Beijing 100083, China; (H.Y.); (W.L.); (W.G.); (M.W.)
| | - Wenbo Guo
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Engineering, China Agricultural University, 17 Qinghua Donglu, P.O. Box 50, Beijing 100083, China; (H.Y.); (W.L.); (W.G.); (M.W.)
| | - Dong Li
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Engineering, China Agricultural University, 17 Qinghua Donglu, P.O. Box 50, Beijing 100083, China; (H.Y.); (W.L.); (W.G.); (M.W.)
| | - Lijun Wang
- Beijing Key Laboratory of Functional Food from Plant Resources, College of Food Science and Nutritional Engineering, China Agricultural University, Beijing 100083, China
| | - Min Wu
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Engineering, China Agricultural University, 17 Qinghua Donglu, P.O. Box 50, Beijing 100083, China; (H.Y.); (W.L.); (W.G.); (M.W.)
| | - Yong Wang
- School of Chemical Engineering, University of New South Wales, Sydney, NSW 2052, Australia;
| |
Collapse
|
10
|
Xiao J, Chen T, Hu X, Zhang G, Wang S. Boundary-guided context-aware network for camouflaged object detection. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08502-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023]
|
11
|
Zhai Q, Li X, Yang F, Jiao Z, Luo P, Cheng H, Liu Z. MGL: Mutual Graph Learning for Camouflaged Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1897-1910. [PMID: 36417725 DOI: 10.1109/tip.2022.3223216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Camouflaged object detection, which aims to detect/segment the object(s) that blend in with their surrounding, remains challenging for deep models due to the intrinsic similarities between foreground objects and background surroundings. Ideally, an effective model should be capable of finding valuable clues from the given scene and integrating them into a joint learning framework to co-enhance the representation. Inspired by this observation, we propose a novel Mutual Graph Learning (MGL) model by shifting the conventional perspective of mutual learning from regular grids to graph domain. Specifically, an image is decoupled by MGL into two task-specific feature maps - one for finding the rough location of the target and the other for capturing its accurate boundary details. Then, the mutual benefits can be fully exploited by reasoning their high-order relations through graphs recurrently. It should be noted that our method is different from most mutual learning models that model all between-task interactions with the use of a shared function. To increase information interactions, MGL is built with typed functions for dealing with different complementary relations. To overcome the accuracy loss caused by interpolation to higher resolution and the computational redundancy resulting from recurrent learning, the S-MGL is equipped with a multi-source attention contextual recovery module, called R-MGL_v2, which uses the pixel feature information iteratively. Experiments on challenging datasets, including CHAMELEON, CAMO, COD10K, and NC4K demonstrate the effectiveness of our MGL with superior performance to existing state-of-the-art methods. The code can be found at https://github.com/fanyang587/MGL.
Collapse
|
12
|
Pang Y, Zhao X, Zhang L, Lu H. CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:892-904. [PMID: 37018701 DOI: 10.1109/tip.2023.3234702] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components.
Collapse
|
13
|
Wu Z, Allibert G, Meriaudeau F, Ma C, Demonceaux C. HiDAnet: RGB-D Salient Object Detection via Hierarchical Depth Awareness. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:2160-2173. [PMID: 37027289 DOI: 10.1109/tip.2023.3263111] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
RGB-D saliency detection aims to fuse multi-modal cues to accurately localize salient regions. Existing works often adopt attention modules for feature modeling, with few methods explicitly leveraging fine-grained details to merge with semantic cues. Thus, despite the auxiliary depth information, it is still challenging for existing models to distinguish objects with similar appearances but at distinct camera distances. In this paper, from a new perspective, we propose a novel Hierarchical Depth Awareness network (HiDAnet) for RGB-D saliency detection. Our motivation comes from the observation that the multi-granularity properties of geometric priors correlate well with the neural network hierarchies. To realize multi-modal and multi-level fusion, we first use a granularity-based attention scheme to strengthen the discriminatory power of RGB and depth features separately. Then we introduce a unified cross dual-attention module for multi-modal and multi-level fusion in a coarse-to-fine manner. The encoded multi-modal features are gradually aggregated into a shared decoder. Further, we exploit a multi-scale loss to take full advantage of the hierarchical information. Extensive experiments on challenging benchmark datasets demonstrate that our HiDAnet performs favorably over the state-of-the-art methods by large margins. The source code can be found in https://github.com/Zongwei97/HIDANet/.
Collapse
|
14
|
Zhou X, Shen K, Weng L, Cong R, Zheng B, Zhang J, Yan C. Edge-Guided Recurrent Positioning Network for Salient Object Detection in Optical Remote Sensing Images. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:539-552. [PMID: 35417369 DOI: 10.1109/tcyb.2022.3163152] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Optical remote sensing images (RSIs) have been widely used in many applications, and one of the interesting issues about optical RSIs is the salient object detection (SOD). However, due to diverse object types, various object scales, numerous object orientations, and cluttered backgrounds in optical RSIs, the performance of the existing SOD models often degrade largely. Meanwhile, cutting-edge SOD models targeting optical RSIs typically focus on suppressing cluttered backgrounds, while they neglect the importance of edge information which is crucial for obtaining precise saliency maps. To address this dilemma, this article proposes an edge-guided recurrent positioning network (ERPNet) to pop-out salient objects in optical RSIs, where the key point lies in the edge-aware position attention unit (EPAU). First, the encoder is used to give salient objects a good representation, that is, multilevel deep features, which are then delivered into two parallel decoders, including: 1) an edge extraction part and 2) a feature fusion part. The edge extraction module and the encoder form a U-shape architecture, which not only provides accurate salient edge clues but also ensures the integrality of edge information by extra deploying the intraconnection. That is to say, edge features can be generated and reinforced by incorporating object features from the encoder. Meanwhile, each decoding step of the feature fusion module provides the position attention about salient objects, where position cues are sharpened by the effective edge information and are used to recurrently calibrate the misaligned decoding process. After that, we can obtain the final saliency map by fusing all position attention cues. Extensive experiments are conducted on two public optical RSIs datasets, and the results show that the proposed ERPNet can accurately and completely pop-out salient objects, which consistently outperforms the state-of-the-art SOD models.
Collapse
|
15
|
Wu YH, Liu Y, Xu J, Bian JW, Gu YC, Cheng MM. MobileSal: Extremely Efficient RGB-D Salient Object Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:10261-10269. [PMID: 34898430 DOI: 10.1109/tpami.2021.3134684] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The high computational cost of neural networks has prevented recent successes in RGB-D salient object detection (SOD) from benefiting real-world applications. Hence, this article introduces a novel network, MobileSal, which focuses on efficient RGB-D SOD using mobile networks for deep feature extraction. However, mobile networks are less powerful in feature representation than cumbersome networks. To this end, we observe that the depth information of color images can strengthen the feature representation related to SOD if leveraged properly. Therefore, we propose an implicit depth restoration (IDR) technique to strengthen the mobile networks' feature representation capability for RGB-D SOD. IDR is only adopted in the training phase and is omitted during testing, so it is computationally free. Besides, we propose compact pyramid refinement (CPR) for efficient multi-level feature aggregation to derive salient objects with clear boundaries. With IDR and CPR incorporated, MobileSal performs favorably against state-of-the-art methods on six challenging RGB-D SOD datasets with much faster speed (450fps for the input size of 320×320) and fewer parameters (6.5M). The code is released at https://mmcheng.net/mobilesal.
Collapse
|
16
|
Chen T, Xiao J, Hu X, Zhang G, Wang S. Adaptive Fusion Network For RGB-D Salient Object Detection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
17
|
SGC-ARANet: scale-wise global contextual axile reverse attention network for automatic brain tumor segmentation. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04209-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
Few-shot learning-based RGB-D salient object detection: A case study. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
19
|
Gao L, Liu B, Fu P, Xu M. Depth-aware Inverted Refinement Network for RGB-D Salient Object Detection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.11.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
|
20
|
Song M, Song W, Yang G, Chen C. Improving RGB-D Salient Object Detection via Modality-Aware Decoder. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6124-6138. [PMID: 36112559 DOI: 10.1109/tip.2022.3205747] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Most existing RGB-D salient object detection (SOD) methods are primarily focusing on cross-modal and cross-level saliency fusion, which has been proved to be efficient and effective. However, these methods still have a critical limitation, i.e., their fusion patterns - typically the combination of selective characteristics and its variations, are too highly dependent on the network's non-linear adaptability. In such methods, the balances between RGB and D (Depth) are formulated individually considering the intermediate feature slices, but the relation at the modality level may not be learned properly. The optimal RGB-D combinations differ depending on the RGB-D scenarios, and the exact complementary status is frequently determined by multiple modality-level factors, such as D quality, the complexity of the RGB scene, and degree of harmony between them. Therefore, given the existing approaches, it may be difficult for them to achieve further performance breakthroughs, as their methodologies belong to some methods that are somewhat less modality sensitive. To conquer this problem, this paper presents the Modality-aware Decoder (MaD). The critical technical innovations include a series of feature embedding, modality reasoning, and feature back-projecting and collecting strategies, all of which upgrade the widely-used multi-scale and multi-level decoding process to be modality-aware. Our MaD achieves competitive performance over other state-of-the-art (SOTA) models without using any fancy tricks in the decoder's design. Codes and results will be publicly available at https://github.com/MengkeSong/MaD.
Collapse
|
21
|
Wang Z, Wang P, Han Y, Zhang X, Sun M, Tian Q. Curiosity-Driven Salient Object Detection With Fragment Attention. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:5989-6001. [PMID: 36099213 DOI: 10.1109/tip.2022.3203605] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Recent deep learning based salient object detection methods with attention mechanisms have made great success. However, existing attention mechanisms can be generally separated into two categories. One part chooses to calculate weights indiscriminately, which yields computational redundancy. While one part focuses randomly on a small part of the images, such as hard attention, resulting in incorrectness owing to insufficiently targeted selection of a subset of tokens. To alleviate these problems, we design a Curiosity-driven Network (CNet) and a Curiosity-driven Learning Algorithm (CLA) based on fragment attention (FA) mechanism newly defined in this paper. FA imitates the process of cognition perception driven by human curiosity, and divides the degree of curiosity into three levels, i.e. curious, a little curious and not curious. These three levels correspond to five saliency degrees, including salient and non-salient, likewise salient and likewise non-salient, completely uncertain. With more knowledge gained by the network, CLA transforms the curiosity degree of each pixel to yield enhanced detail-enriched saliency maps. In order to extract more context-aware information of potential salient objects and make a better foundation for CLA, a high-level feature extraction module (HFEM) is further proposed. Based on the much better high-level features extracted by HFEM, FA can classify the curiosity degree for each pixel more reasonably and accurately. Extensive experiments on five popular datasets clearly demonstrate that our method outperforms the state-of-the-art approaches without any pre-processing operations or post-processing operations.
Collapse
|
22
|
Yue H, Guo J, Yin X, Zhang Y, Zheng S, Zhang Z, Li C. Salient object detection in low-light images via functional optimization-inspired feature polishing. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
23
|
Zhang J, Fan DP, Dai Y, Anwar S, Saleh F, Aliakbarian S, Barnes N. Uncertainty Inspired RGB-D Saliency Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:5761-5779. [PMID: 33856982 DOI: 10.1109/tpami.2021.3073564] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
We propose the first stochastic framework to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection models treat this task as a point estimation problem by predicting a single saliency map following a deterministic learning pipeline. We argue that, however, the deterministic solution is relatively ill-posed. Inspired by the saliency data labeling process, we propose a generative architecture to achieve probabilistic RGB-D saliency detection which utilizes a latent variable to model the labeling variations. Our framework includes two main models: 1) a generator model, which maps the input image and latent variable to stochastic saliency prediction, and 2) an inference model, which gradually updates the latent variable by sampling it from the true or approximate posterior distribution. The generator model is an encoder-decoder saliency network. To infer the latent variable, we introduce two different solutions: i) a Conditional Variational Auto-encoder with an extra encoder to approximate the posterior distribution of the latent variable; and ii) an Alternating Back-Propagation technique, which directly samples the latent variable from the true posterior distribution. Qualitative and quantitative results on six challenging RGB-D benchmark datasets show our approach's superior performance in learning the distribution of saliency maps. The source code is publicly available via our project page: https://github.com/JingZhang617/UCNet.
Collapse
|
24
|
Image classification based on self-distillation. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04008-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
25
|
Zheng L, Xiao G, Shi Z, Wang S, Ma J. MSA-Net: Establishing Reliable Correspondences by Multiscale Attention Network. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4598-4608. [PMID: 35776808 DOI: 10.1109/tip.2022.3186535] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this paper, we propose a novel multi-scale attention based network (called MSA-Net) for feature matching problems. Current deep networks based feature matching methods suffer from limited effectiveness and robustness when applied to different scenarios, due to random distributions of outliers and insufficient information learning. To address this issue, we propose a multi-scale attention block to enhance the robustness to outliers, for improving the representational ability of the feature map. In addition, we also design a novel context channel refine block and a context spatial refine block to mine the information context with less parameters along channel and spatial dimensions, respectively. The proposed MSA-Net is able to effectively infer the probability of correspondences being inliers with less parameters. Extensive experiments on outlier removal and relative pose estimation have shown the performance improvements of our network over current state-of-the-art methods with less parameters on both outdoor and indoor datasets. Notably, our proposed network achieves an 11.7% improvement at error threshold 5° without RANSAC than the state-of-the-art method on relative pose estimation task when trained on YFCC100M dataset.
Collapse
|
26
|
Pei J, Zhou T, Tang H, Liu C, Chen C. FGO-Net: Feature and Gaussian Optimization Network for visual saliency prediction. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03647-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
27
|
A2TPNet: Alternate Steered Attention and Trapezoidal Pyramid Fusion Network for RGB-D Salient Object Detection. ELECTRONICS 2022. [DOI: 10.3390/electronics11131968] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
RGB-D salient object detection (SOD) aims at locating the most eye-catching object in visual input by fusing complementary information of RGB modality and depth modality. Most of the existing RGB-D SOD methods integrate multi-modal features to generate the saliency map indiscriminately, ignoring the ambiguity between different modalities. To better use multi-modal complementary information and alleviate the negative impact of ambiguity among different modalities, this paper proposes a novel Alternate Steered Attention and Trapezoidal Pyramid Fusion Network (A2TPNet) for RGB-D SOD composed of Cross-modal Alternate Fusion Module (CAFM) and Trapezoidal Pyramid Fusion Module (TPFM). CAFM is focused on fusing cross-modal features, taking full consideration of the ambiguity between cross-modal data by an Alternate Steered Attention (ASA), and it reduces the interference of redundant information and non-salient features in the interactive process through a collaboration mechanism containing channel attention and spatial attention. TPFM endows the RGB-D SOD model with more powerful feature expression capabilities by combining multi-scale features to enhance the expressive ability of contextual semantics of the model. Extensive experimental results on five publicly available datasets demonstrate that the proposed model consistently outperforms 17 state-of-the-art methods.
Collapse
|
28
|
Spatiotemporal context-aware network for video salient object detection. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07330-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
29
|
Object Detection by Attention-Guided Feature Fusion Network. Symmetry (Basel) 2022. [DOI: 10.3390/sym14050887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
One of the most noticeable characteristics of security issues is the prevalence of “Security Asymmetry”. The safety of production and even the lives of workers can be jeopardized if risk factors aren’t detected in time. Today, object detection technology plays a vital role in actual operating conditions. For the sake of warning danger and ensuring the work security, we propose the Attention-guided Feature Fusion Network method and apply it to the Helmet Detection in this paper. AFFN method, which is capable of reliably detecting objects of a wider range of sizes, outperforms previous methods with an mAP value of 85.3% and achieves an excellent result in helmet detection with an mAP value of 62.4%. From objects of finite sizes to a wider range of sizes, the proposed method achieves “symmetry” in the sense of detection.
Collapse
|
30
|
An Accurate Refinement Pathway for Visual Tracking. INFORMATION 2022. [DOI: 10.3390/info13030147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Recently, in the field of visual object tracking, visual object tracking algorithms combined with visual object segmentation have achieved impressive results while using mask to label targets in the VOT2020 dataset. Most of the trackers get the object mask by increasing the resolution through multiple upsampling modules and gradually get the mask by summing with the features in the backbone network. However, this refinement pathway does not fully consider the spatial information of the backbone features, and therefore, the segmentation results are not perfect. In this paper, the cross-stage and cross-resolution (CSCR) module is proposed for optimizing the segmentation effect. This module makes full use of the semantic information of high-level features and the spatial information of low-level features, and fuses them by skip connections to achieve a very accurate segmentation effect. Experiments were conducted on the VOT dataset, and the experimental results outperformed other excellent trackers and verified the effectiveness of the algorithm in this paper.
Collapse
|
31
|
Xu Y, Yu X, Zhang J, Zhu L, Wang D. Weakly Supervised RGB-D Salient Object Detection With Prediction Consistency Training and Active Scribble Boosting. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2148-2161. [PMID: 35196231 DOI: 10.1109/tip.2022.3151999] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
RGB-D salient object detection (SOD) has attracted increasingly more attention as it shows more robust results in complex scenes compared with RGB SOD. However, state-of-the-art RGB-D SOD approaches heavily rely on a large amount of pixel-wise annotated data for training. Such densely labeled annotations are often labor-intensive and costly. To reduce the annotation burden, we investigate RGB-D SOD from a weakly supervised perspective. More specifically, we use annotator-friendly scribble annotations as supervision signals for model training. Since scribble annotations are much sparser compared to ground-truth masks, some critical object structure information might be neglected. To preserve such structure information, we explicitly exploit the complementary edge information from two modalities (i.e., RGB and depth). Specifically, we leverage the dual-modal edge guidance and introduce a new network architecture with a dual-edge detection module and a modality-aware feature fusion module. In order to use the useful information of unlabeled pixels, we introduce a prediction consistency training scheme by comparing the predictions of two networks optimized by different strategies. Moreover, we develop an active scribble boosting strategy to provide extra supervision signals with negligible annotation cost, leading to significant SOD performance improvement. Extensive experiments on seven benchmarks validate the superiority of our proposed method. Remarkably, the proposed method with scribble annotations achieves competitive performance in comparison to fully supervised state-of-the-art methods.
Collapse
|
32
|
Wang F, Pan J, Xu S, Tang J. Learning Discriminative Cross-Modality Features for RGB-D Saliency Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1285-1297. [PMID: 35015637 DOI: 10.1109/tip.2022.3140606] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
How to explore useful information from depth is the key success of the RGB-D saliency detection methods. While the RGB and depth images are from different domains, a modality gap will lead to unsatisfactory results for simple feature concatenation. Towards better performance, most methods focus on bridging this gap and designing different cross-modal fusion modules for features, while ignoring explicitly extracting some useful consistent information from them. To overcome this problem, we develop a simple yet effective RGB-D saliency detection method by learning discriminative cross-modality features based on the deep neural network. The proposed method first learns modality-specific features for RGB and depth inputs. And then we separately calculate the correlations of every pixel-pair in a cross-modality consistent way, i.e., the distribution ranges are consistent for the correlations calculated based on features extracted from RGB (RGB correlation) or depth inputs (depth correlation). From different perspectives, color or spatial, the RGB and depth correlations end up at the same point to depict how tightly each pixel-pair is related. Secondly, to complemently gather RGB and depth information, we propose a novel correlation-fusion to fuse RGB and depth correlations, resulting in a cross-modality correlation. Finally, the features are refined with both long-range cross-modality correlations and local depth correlations to predict salient maps. In which, the long-range cross-modality correlation provides context information for accurate localization, and the local depth correlation keeps good subtle structures for fine segmentation. In addition, a lightweight DepthNet is designed for efficient depth feature extraction. We solve the proposed network in an end-to-end manner. Both quantitative and qualitative experimental results demonstrate the proposed algorithm achieves favorable performance against state-of-the-art methods.
Collapse
|
33
|
RGB-T salient object detection via CNN feature and result saliency map fusion. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02984-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
34
|
Abstract
Deep learning has recently attracted extensive attention and developed significantly in remote sensing image super-resolution. Although remote sensing images are composed of various scenes, most existing methods consider each part equally. These methods ignore the salient objects (e.g., buildings, airplanes, and vehicles) that have more complex structures and require more attention in recovery processing. This paper proposes a saliency-guided remote sensing image super-resolution (SG-GAN) method to alleviate the above issue while maintaining the merits of GAN-based methods for the generation of perceptual-pleasant details. More specifically, we exploit the salient maps of images to guide the recovery in two aspects: On the one hand, the saliency detection network in SG-GAN learns more high-resolution saliency maps to provide additional structure priors. On the other hand, the well-designed saliency loss imposes a second-order restriction on the super-resolution process, which helps SG-GAN concentrate more on the salient objects of remote sensing images. Experimental results show that SG-GAN achieves competitive PSNR and SSIM compared with the advanced super-resolution methods. Visual results demonstrate our superiority in restoring structures while generating remote sensing super-resolution images.
Collapse
|
35
|
|
36
|
Zhai Y, Fan DP, Yang J, Borji A, Shao L, Han J, Wang L. Bifurcated Backbone Strategy for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:8727-8742. [PMID: 34613915 DOI: 10.1109/tip.2021.3116793] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Multi-level feature fusion is a fundamental topic in computer vision. It has been exploited to detect, segment and classify objects at various scales. When multi-level features meet multi-modal cues, the optimal feature aggregation and multi-modal learning strategy become a hot potato. In this paper, we leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel Bifurcated Backbone Strategy Network (BBS-Net). Our architecture, is simple, efficient, and backbone-independent. In particular, first, we propose to regroup the multi-level features into teacher and student features using a bifurcated backbone strategy (BBS). Second, we introduce a depth-enhanced module (DEM) to excavate informative depth cues from the channel and spatial views. Then, RGB and depth modalities are fused in a complementary way. Extensive experiments show that BBS-Net significantly outperforms 18 state-of-the-art (SOTA) models on eight challenging datasets under five evaluation measures, demonstrating the superiority of our approach (~4% improvement in S-measure vs . the top-ranked model: DMRA). In addition, we provide a comprehensive analysis on the generalization ability of different RGB-D datasets and provide a powerful training set for future research. The complete algorithm, benchmark results, and post-processing toolbox are publicly available at https://github.com/zyjwuyan/BBS-Net.
Collapse
|
37
|
Mao A, Huang E, Gan H, Parkes RSV, Xu W, Liu K. Cross-Modality Interaction Network for Equine Activity Recognition Using Imbalanced Multi-Modal Data. SENSORS 2021; 21:s21175818. [PMID: 34502709 PMCID: PMC8434387 DOI: 10.3390/s21175818] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 08/25/2021] [Accepted: 08/27/2021] [Indexed: 11/16/2022]
Abstract
With the recent advances in deep learning, wearable sensors have increasingly been used in automated animal activity recognition. However, there are two major challenges in improving recognition performance—multi-modal feature fusion and imbalanced data modeling. In this study, to improve classification performance for equine activities while tackling these two challenges, we developed a cross-modality interaction network (CMI-Net) involving a dual convolution neural network architecture and a cross-modality interaction module (CMIM). The CMIM adaptively recalibrated the temporal- and axis-wise features in each modality by leveraging multi-modal information to achieve deep intermodality interaction. A class-balanced (CB) focal loss was adopted to supervise the training of CMI-Net to alleviate the class imbalance problem. Motion data was acquired from six neck-attached inertial measurement units from six horses. The CMI-Net was trained and verified with leave-one-out cross-validation. The results demonstrated that our CMI-Net outperformed the existing algorithms with high precision (79.74%), recall (79.57%), F1-score (79.02%), and accuracy (93.37%). The adoption of CB focal loss improved the performance of CMI-Net, with increases of 2.76%, 4.16%, and 3.92% in precision, recall, and F1-score, respectively. In conclusion, CMI-Net and CB focal loss effectively enhanced the equine activity classification performance using imbalanced multi-modal sensor data.
Collapse
Affiliation(s)
- Axiu Mao
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China; (A.M.); (H.G.)
| | - Endai Huang
- Department of Computer Science, City University of Hong Kong, Hong Kong, China; (E.H.); (W.X.)
| | - Haiming Gan
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China; (A.M.); (H.G.)
- College of Electronic Engineering, South China Agricultural University, Guangzhou 510642, China
| | - Rebecca S. V. Parkes
- Department of Veterinary Clinical Sciences, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China;
- Centre for Companion Animal Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China
| | - Weitao Xu
- Department of Computer Science, City University of Hong Kong, Hong Kong, China; (E.H.); (W.X.)
| | - Kai Liu
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China; (A.M.); (H.G.)
- Animal Health Research Centre, Chengdu Research Institute, City University of Hong Kong, Chengdu 610000, China
- Correspondence:
| |
Collapse
|
38
|
Tu Z, Li Z, Li C, Lang Y, Tang J. Multi-Interactive Dual-Decoder for RGB-Thermal Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5678-5691. [PMID: 34125680 DOI: 10.1109/tip.2021.3087412] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
RGB-thermal salient object detection (SOD) aims to segment the common prominent regions of visible image and corresponding thermal infrared image that we call it RGBT SOD. Existing methods don't fully explore and exploit the potentials of complementarity of different modalities and multi-type cues of image contents, which play a vital role in achieving accurate results. In this paper, we propose a multi-interactive dual-decoder to mine and model the multi-type interactions for accurate RGBT SOD. In specific, we first encode two modalities into multi-level multi-modal feature representations. Then, we design a novel dual-decoder to conduct the interactions of multi-level features, two modalities and global contexts. With these interactions, our method works well in diversely challenging scenarios even in the presence of invalid modality. Finally, we carry out extensive experiments on public RGBT and RGBD SOD datasets, and the results show that the proposed method achieves the outstanding performance against state-of-the-art algorithms. The source code has been released at: https://github.com/lz118/Multi-interactive-Dual-decoder.
Collapse
|
39
|
Fu K, Fan DP, Ji GP, Zhao Q, Shen J, Zhu C. Siamese Network for RGB-D Salient Object Detection and Beyond. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; PP:1-1. [PMID: 33861691 DOI: 10.1109/tpami.2021.3073689] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Existing RGB-D salient object detection (SOD) models usually treat RGB and depth as independent information and design separate networks for feature extraction from each. Such schemes can easily be constrained by a limited amount of training data or over-reliance on an elaborately designed training process. Inspired by the observation that RGB and depth modalities actually present certain commonality in distinguishing salient objects, a novel joint learning and densely cooperative fusion (JL-DCF) architecture is designed to learn from both RGB and depth inputs through a shared network backbone, known as the Siamese architecture. In this paper, we propose two effective components: joint learning (JL), and densely cooperative fusion (DCF). The JL module provides robust saliency feature learning by exploiting cross-modal commonality via a Siamese network, while the DCF module is introduced for complementary feature discovery. Comprehensive experiments using 5 popular metrics show that the designed framework yields a robust RGB-D saliency detector with good generalization. As a result, JL-DCF significantly advances the SOTAs by an average of ~2.0% (F-measure) across 7 challenging datasets. In addition, we show that JL-DCF is readily applicable to other related multi-modal detection tasks, including RGB-T SOD and video SOD, achieving comparable or better performance.
Collapse
|
40
|
Li G, Liu Z, Chen M, Bai Z, Lin W, Ling H. Hierarchical Alternate Interaction Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:3528-3542. [PMID: 33667161 DOI: 10.1109/tip.2021.3062689] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Existing RGB-D Salient Object Detection (SOD) methods take advantage of depth cues to improve the detection accuracy, while pay insufficient attention to the quality of depth information. In practice, a depth map is often with uneven quality and sometimes suffers from distractors, due to various factors in the acquisition procedure. In this article, to mitigate distractors in depth maps and highlight salient objects in RGB images, we propose a Hierarchical Alternate Interactions Network (HAINet) for RGB-D SOD. Specifically, HAINet consists of three key stages: feature encoding, cross-modal alternate interaction, and saliency reasoning. The main innovation in HAINet is the Hierarchical Alternate Interaction Module (HAIM), which plays a key role in the second stage for cross-modal feature interaction. HAIM first uses RGB features to filter distractors in depth features, and then the purified depth features are exploited to enhance RGB features in turn. The alternate RGB-depth-RGB interaction proceeds in a hierarchical manner, which progressively integrates local and global contexts within a single feature scale. In addition, we adopt a hybrid loss function to facilitate the training of HAINet. Extensive experiments on seven datasets demonstrate that our HAINet not only achieves competitive performance as compared with 19 relevant state-of-the-art methods, but also reaches a real-time processing speed of 43 fps on a single NVIDIA Titan X GPU. The code and results of our method are available at https://github.com/MathLee/HAINet.
Collapse
|
41
|
Jin WD, Xu J, Han Q, Zhang Y, Cheng MM. CDNet: Complementary Depth Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:3376-3390. [PMID: 33646949 DOI: 10.1109/tip.2021.3060167] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Current RGB-D salient object detection (SOD) methods utilize the depth stream as complementary information to the RGB stream. However, the depth maps are usually of low-quality in existing RGB-D SOD datasets. Most RGB-D SOD networks trained with these datasets would produce error-prone results. In this paper, we propose a novel Complementary Depth Network (CDNet) to well exploit saliency-informative depth features for RGB-D SOD. To alleviate the influence of low-quality depth maps to RGB-D SOD, we propose to select saliency-informative depth maps as the training targets and leverage RGB features to estimate meaningful depth maps. Besides, to learn robust depth features for accurate prediction, we propose a new dynamic scheme to fuse the depth features extracted from the original and estimated depth maps with adaptive weights. What's more, we design a two-stage cross-modal feature fusion scheme to well integrate the depth features with the RGB ones, further improving the performance of our CDNet on RGB-D SOD. Experiments on seven benchmark datasets demonstrate that our CDNet outperforms state-of-the-art RGB-D SOD methods. The code is publicly available at https://github.com/blanclist/CDNet.
Collapse
|
42
|
Zhou T, Fan DP, Cheng MM, Shen J, Shao L. RGB-D salient object detection: A survey. COMPUTATIONAL VISUAL MEDIA 2021; 7:37-69. [PMID: 33432275 PMCID: PMC7788385 DOI: 10.1007/s41095-020-0199-z] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Accepted: 10/07/2020] [Indexed: 06/12/2023]
Abstract
Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.
Collapse
Affiliation(s)
- Tao Zhou
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | - Deng-Ping Fan
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | | | - Jianbing Shen
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| | - Ling Shao
- Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates
| |
Collapse
|
43
|
Zhou W, Pan S, Lei J, Yu L. TMFNet: Three-Input Multilevel Fusion Network for Detecting Salient Objects in RGB-D Images. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2021.3097393] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|