1
|
Yuan Y, Gao P, Dai Q, Qin J, Xiang W. Uncertainty-Guided Refinement for Fine-Grained Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:2301-2314. [PMID: 40202876 DOI: 10.1109/tip.2025.3557562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/11/2025]
Abstract
Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model's perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model's perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at https://github.com/I2-Multimedia-Lab/UGRAN.
Collapse
|
2
|
Liu F, Gao C, Chen F, Meng D, Zuo W, Gao X. Infrared Small and Dim Target Detection With Transformer Under Complex Backgrounds. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:5921-5932. [PMID: 37883292 DOI: 10.1109/tip.2023.3326396] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
The infrared small and dim (S&D) target detection is one of the key techniques in the infrared search and tracking system. Since the local regions similar to infrared S&D targets spread over the whole background, exploring the correlation amongst image features in large-range dependencies to mine the difference between the target and background is crucial for robust detection. However, existing deep learning-based methods are limited by the locality of convolutional neural networks, which impairs the ability to capture large-range dependencies. Additionally, the S&D appearance of the infrared target makes the detection model highly possible to miss detection. To this end, we propose a robust and general infrared S&D target detection method with the transformer. We adopt the self-attention mechanism of the transformer to learn the correlation of image features in a larger range. Moreover, we design a feature enhancement module to learn discriminative features of S&D targets to avoid miss-detections. After that, to avoid the loss of the target information, we adopt a decoder with the U-Net-like skip connection operation to contain more information of S&D targets. Finally, we get the detection result by a segmentation head. Extensive experiments on two public datasets show the obvious superiority of the proposed method over state-of-the-art methods, and the proposed method has a stronger generalization ability and better noise tolerance.
Collapse
|
3
|
Zhou W, Zhu Y, Lei J, Yang R, Yu L. LSNet: Lightweight Spatial Boosting Network for Detecting Salient Objects in RGB-Thermal Images. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1329-1340. [PMID: 37022901 DOI: 10.1109/tip.2023.3242775] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Most recent methods for RGB (red-green-blue)-thermal salient object detection (SOD) involve several floating-point operations and have numerous parameters, resulting in slow inference, especially on common processors, and impeding their deployment on mobile devices for practical applications. To address these problems, we propose a lightweight spatial boosting network (LSNet) for efficient RGB-thermal SOD with a lightweight MobileNetV2 backbone to replace a conventional backbone (e.g., VGG, ResNet). To improve feature extraction using a lightweight backbone, we propose a boundary boosting algorithm that optimizes the predicted saliency maps and reduces information collapse in low-dimensional features. The algorithm generates boundary maps based on predicted saliency maps without incurring additional calculations or complexity. As multimodality processing is essential for high-performance SOD, we adopt attentive feature distillation and selection and propose semantic and geometric transfer learning to enhance the backbone without increasing the complexity during testing. Experimental results demonstrate that the proposed LSNet achieves state-of-the-art performance compared with 14 RGB-thermal SOD methods on three datasets while improving the numbers of floating-point operations (1.025G) and parameters (5.39M), model size (22.1 MB), and inference speed (9.95 fps for PyTorch, batch size of 1, and Intel i5-7500 processor; 93.53 fps for PyTorch, batch size of 1, and NVIDIA TITAN V graphics processor; 936.68 fps for PyTorch, batch size of 20, and graphics processor; 538.01 fps for TensorRT and batch size of 1; and 903.01 fps for TensorRT/FP16 and batch size of 1). The code and results can be found from the link of https://github.com/zyrant/LSNet.
Collapse
|
4
|
Fan DP, Zhang J, Xu G, Cheng MM, Shao L. Salient Objects in Clutter. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:2344-2366. [PMID: 35404809 DOI: 10.1109/tpami.2022.3166451] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this paper, we identify and address a serious design bias of existing salient object detection (SOD) datasets, which unrealistically assume that each image should contain at least one clear and uncluttered salient object. This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets. However, these models are still far from satisfactory when applied to real-world scenes. Based on our analyses, we propose a new high-quality dataset and update the previous saliency benchmark. Specifically, our dataset, called Salient Objects in Clutter (SOC), includes images with both salient and non-salient objects from several common object categories. In addition to object category annotations, each salient image is accompanied by attributes that reflect common challenges in common scenes, which can help provide deeper insight into the SOD problem. Further, with a given saliency encoder, e.g., the backbone network, existing saliency models are designed to achieve mapping from the training image set to the training ground-truth set. We therefore argue that improving the dataset can yield higher performance gains than focusing only on the decoder design. With this in mind, we investigate several dataset-enhancement strategies, including label smoothing to implicitly emphasize salient boundaries, random image augmentation to adapt saliency models to various scenarios, and self-supervised learning as a regularization strategy to learn from small datasets. Our extensive results demonstrate the effectiveness of these tricks. We also provide a comprehensive benchmark for SOD, which can be found in our repository: https://github.com/DengPingFan/SODBenchmark.
Collapse
|
5
|
|
6
|
Yang A, Cheng S, Song S, Wang J, Ji Z, Pang Y, Cao J. Saliency detection network with two-stream encoder and interactive decoder. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
7
|
Zhu W, Gao Z, Wang Y. A novel approach for audible acoustic quick response codes. Sci Rep 2022; 12:6417. [PMID: 35440603 PMCID: PMC9016685 DOI: 10.1038/s41598-022-09858-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 03/22/2022] [Indexed: 12/04/2022] Open
Abstract
Compared to image-based quick response (QR) codes, acoustic QR codes have some advantages. However, an acoustic QR scanner cannot recognize an acoustic QR code at a distance of more than two meters from an acoustic QR announcer. To this end, we propose a new sort of acoustic QR code, called an audible acoustic QR code (AAQRC), which employs humanly audible sound to carry users’ information directly. First, a user’s string of characters is translated into a string of pitches. Then, the related algorithms convert the string of pitches into a playable audio file. As a result, an AAQRC is generated, consisting of the audio itself. AAQRC recognition is the opposite process of AAQRC generation. Compared with the existing approach for acoustic QR codes, the new method can recognize acoustic QR codes at a longer distance, even if there are obstacles between the AAQRC announcer and AAQRC scanner.
Collapse
Affiliation(s)
- Weijun Zhu
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China. .,School of Electronics Engineering and Computer Science, Peking University, Beijing, China. .,School of Network Engineering, Zhoukou Normal University, Zhoukou, China.
| | - Ziang Gao
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
| | - Yiran Wang
- School of Network Engineering, Zhoukou Normal University, Zhoukou, China
| |
Collapse
|
8
|
Fang X, Zhu J, Shao X, Wang H. LC3Net: Ladder context correlation complementary network for salient object detection. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108372] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
9
|
Cheng J, Hao F, Liu L, Tao D. Imposing Semantic Consistency of Local Descriptors for Few-Shot Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1587-1600. [PMID: 35073265 DOI: 10.1109/tip.2022.3143692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Few-shot learning suffers from the scarcity of labeled training data. Regarding local descriptors of an image as representations for the image could greatly augment existing labeled training data. Existing local descriptor based few-shot learning methods have taken advantage of this fact but ignore that the semantics exhibited by local descriptors may not be relevant to the image semantic. In this paper, we deal with this issue from a new perspective of imposing semantic consistency of local descriptors of an image. Our proposed method consists of three modules. The first one is a local descriptor extractor module, which can extract a large number of local descriptors in a single forward pass. The second one is a local descriptor compensator module, which compensates the local descriptors with the image-level representation, in order to align the semantics between local descriptors and the image semantic. The third one is a local descriptor based contrastive loss function, which supervises the learning of the whole pipeline, with the aim of making the semantics carried by the local descriptors of an image relevant and consistent with the image semantic. Theoretical analysis demonstrates the generalization ability of our proposed method. Comprehensive experiments conducted on benchmark datasets indicate that our proposed method achieves the semantic consistency of local descriptors and the state-of-the-art performance.
Collapse
|
10
|
Selecting Post-Processing Schemes for Accurate Detection of Small Objects in Low-Resolution Wide-Area Aerial Imagery. REMOTE SENSING 2022. [DOI: 10.3390/rs14020255] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In low-resolution wide-area aerial imagery, object detection algorithms are categorized as feature extraction and machine learning approaches, where the former often requires a post-processing scheme to reduce false detections and the latter demands multi-stage learning followed by post-processing. In this paper, we present an approach on how to select post-processing schemes for aerial object detection. We evaluated combinations of each of ten vehicle detection algorithms with any of seven post-processing schemes, where the best three schemes for each algorithm were determined using average F-score metric. The performance improvement is quantified using basic information retrieval metrics as well as the classification of events, activities and relationships (CLEAR) metrics. We also implemented a two-stage learning algorithm using a hundred-layer densely connected convolutional neural network for small object detection and evaluated its degree of improvement when combined with the various post-processing schemes. The highest average F-scores after post-processing are 0.902, 0.704 and 0.891 for the Tucson, Phoenix and online VEDAI datasets, respectively. The combined results prove that our enhanced three-stage post-processing scheme achieves a mean average precision (mAP) of 63.9% for feature extraction methods and 82.8% for the machine learning approach.
Collapse
|
11
|
Wang H, Jiao L, Liu F, Li L, Liu X, Ji D, Gan W. IPGN: Interactiveness Proposal Graph Network for Human-Object Interaction Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:6583-6593. [PMID: 34270424 DOI: 10.1109/tip.2021.3096333] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Human-Object Interaction (HOI) Detection is an important task to understand how humans interact with objects. Most of the existing works treat this task as an exhaustive triplet 〈 human, verb, object 〉 classification problem. In this paper, we decompose it and propose a novel two-stage graph model to learn the knowledge of interactiveness and interaction in one network, namely, Interactiveness Proposal Graph Network (IPGN). In the first stage, we design a fully connected graph for learning the interactiveness, which distinguishes whether a pair of human and object is interactive or not. Concretely, it generates the interactiveness features to encode high-level semantic interactiveness knowledge for each pair. The class-agnostic interactiveness is a more general and simpler objective, which can be used to provide reasonable proposals for the graph construction in the second stage. In the second stage, a sparsely connected graph is constructed with all interactive pairs selected by the first stage. Specifically, we use the interactiveness knowledge to guide the message passing. By contrast with the feature similarity, it explicitly represents the connections between the nodes. Benefiting from the valid graph reasoning, the node features are well encoded for interaction learning. Experiments show that the proposed method achieves state-of-the-art performance on both V-COCO and HICO-DET datasets.
Collapse
|
12
|
Wu Z, Su L, Huang Q. Decomposition and Completion Network for Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:6226-6239. [PMID: 34242166 DOI: 10.1109/tip.2021.3093380] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Recently, fully convolutional networks (FCNs) have made great progress in the task of salient object detection and existing state-of-the-arts methods mainly focus on how to integrate edge information in deep aggregation models. In this paper, we propose a novel Decomposition and Completion Network (DCN), which integrates edge and skeleton as complementary information and models the integrity of salient objects in two stages. In the decomposition network, we propose a cross multi-branch decoder, which iteratively takes advantage of cross-task aggregation and cross-layer aggregation to integrate multi-level multi-task features and predict saliency, edge, and skeleton maps simultaneously. In the completion network, edge and skeleton maps are further utilized to fill flaws and suppress noises in saliency maps via hierarchical structure-aware feature learning and multi-scale feature completion. Through jointly learning with edge and skeleton information for localizing boundaries and interiors of salient objects respectively, the proposed network generates precise saliency maps with uniformly and completely segmented salient objects. Experiments conducted on five benchmark datasets demonstrate that the proposed model outperforms existing networks. Furthermore, we extend the proposed model to the task of RGB-D salient object detection, and it also achieves state-of-the-art performance. The code is available at https://github.com/wuzhe71/DCN.
Collapse
|