1
|
Song X, Tan Y, Li X, Hei X. GDVIFNet: A generated depth and visible image fusion network with edge feature guidance for salient object detection. Neural Netw 2025; 188:107445. [PMID: 40209304 DOI: 10.1016/j.neunet.2025.107445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 03/25/2025] [Accepted: 03/27/2025] [Indexed: 04/12/2025]
Abstract
In recent years, despite significant advancements in salient object detection (SOD), performance in complex interference environments remains suboptimal. To address these challenges, additional modalities like depth (SOD-D) or thermal imaging (SOD-T) are often introduced. However, existing methods typically rely on specialized depth or thermal devices to capture these modalities, which can be costly and inconvenient. To address this limitation using only a single RGB image, we propose GDVIFNet, a novel approach that leverages Depth Anything to generate depth images. Since these generated depth images may contain noise and artifacts, we incorporate self-supervised techniques to generate edge feature information. During the process of generating image edge features, the noise and artifacts present in the generated depth images can be effectively removed. Our method employs a dual-branch architecture, combining CNN and Transformer-based branches for feature extraction. We designed the step trimodal interaction unit (STIU) to fuse the RGB features with the depth features from the CNN branch and the self-cross attention fusion (SCF) to integrate RGB features with depth features from the Transformer branch. Finally, guided by edge features from our self-supervised edge guidance module (SEGM), we employ the CNN-Edge-Transformer step fusion (CETSF) to fuse features from both branches. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple datasets. Code can be found at https://github.com/typist2001/GDVIFNet.
Collapse
Affiliation(s)
- Xiaogang Song
- Xi'an University of Technology, School of Computer Science and Engineering, Xi'an, 710048, China; Engineering Research Center of Human-machine integration intelligent robot, Universities of Shaanxi Province, Xi'an, 710048, China.
| | - Yuping Tan
- Xi'an University of Technology, School of Computer Science and Engineering, Xi'an, 710048, China.
| | - Xiaochang Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China.
| | - Xinhong Hei
- Xi'an University of Technology, School of Computer Science and Engineering, Xi'an, 710048, China; Engineering Research Center of Human-machine integration intelligent robot, Universities of Shaanxi Province, Xi'an, 710048, China.
| |
Collapse
|
2
|
Wu Z, Wang W, Wang L, Li Y, Lv F, Xia Q, Chen C, Hao A, Li S. Pixel is All You Need: Adversarial Spatio-Temporal Ensemble Active Learning for Salient Object Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:858-877. [PMID: 39383082 DOI: 10.1109/tpami.2024.3476683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/11/2024]
Abstract
Although weakly-supervised techniques can reduce the labeling effort, it is unclear whether a saliency model trained with weakly-supervised data (e.g., point annotation) can achieve the equivalent performance of its fully-supervised version. This paper attempts to answer this unexplored question by proving a hypothesis: there is a point-labeled dataset where saliency models trained on it can achieve equivalent performance when trained on the densely annotated dataset. To prove this conjecture, we proposed a novel yet effective adversarial spatio-temporal ensemble active learning. Our contributions are four-fold: 1) Our proposed adversarial attack triggering uncertainty can conquer the overconfidence of existing active learning methods and accurately locate these uncertain pixels. 2) Our proposed spatio-temporal ensemble strategy not only achieves outstanding performance but significantly reduces the model's computational cost. 3) Our proposed relationship-aware diversity sampling can conquer oversampling while boosting model performance. 4) We provide theoretical proof for the existence of such a point-labeled dataset. Experimental results show that our approach can find such a point-labeled dataset, where a saliency model trained on it obtained 98%-99% performance of its fully-supervised version with only ten annotated points per image.
Collapse
|
3
|
Chen C, Ma G, Song W, Li S, Hao A, Qin H. Saliency-Free and Aesthetic-Aware Panoramic Video Navigation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; PP:2037-2054. [PMID: 40030675 DOI: 10.1109/tpami.2024.3516874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Most of the existing panoramic video navigation approaches are saliency-driven, whereby off-the-shelf saliency detection tools are directly employed to aid the navigation approaches in localizing video content that should be incorporated into the navigation path. In view of the dilemma faced by our research community, we rethink if the "saliency clues" are really appropriate to serve the panoramic video navigation task. According to our in-depth investigation, we argue that using "saliency clues" cannot generate a satisfying navigation path, failing to well represent the given panoramic video, and the views in the navigation path are also low aesthetics. In this paper, we present a brand-new navigation paradigm. Although our model is still trained on eye-fixations, our methodology can additionally enable the trained model to perceive the "meaningful" degree of the given panoramic video content. Outwardly, the proposed new approach is saliency-free, but inwardly, it is developed from saliency but biasing more to be "meaningful-driven"; thus, it can generate a navigation path with more appropriate content coverage. Besides, this paper is the first attempt to devise an unsupervised learning scheme to ensure all localized meaningful views in the navigation path have high aesthetics. Thus, the navigation path generated by our approach can also bring users an enjoyable watching experience. As a new topic in its infancy, we have devised a series of quantitative evaluation schemes, including objective verifications and subjective user studies. All these innovative attempts would have great potential to inspire and promote this research field in the near future.
Collapse
|
4
|
Wang G, Chen C, Hao A, Qin H, Fan DP. WinDB: HMD-free and Distortion-free Panoptic Video Fixation Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; PP:1694-1713. [PMID: 40030507 DOI: 10.1109/tpami.2024.3510793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
To date, the widely adopted way to perform fixation collection in panoptic video is based on a head-mounted display (HMD), where users' fixations are collected while wearing a HMD to explore the given panoptic scene freely. However, this widely-used data collection method is insufficient for training deep models to accurately predict which regions in a given panoptic are most important when it contains intermittent salient events. The main reason is that there always exist "blind zooms" when using HMD to collect fixations since the users cannot keep spinning their heads to explore the entire panoptic scene all the time. Consequently, the collected fixations tend to be trapped in some local views, leaving the remaining areas to be the "blind zooms". Therefore, fixation data collected using HMD-based methods that accumulate local views cannot accurately represent the overall global importance - the main purpose of fixations - of complex panoptic scenes. To conquer, this paper introduces the auxiliary window with a dynamic blurring (WinDB) fixation collection approach for panoptic video, which doesn't need HMD and is able to well reflect the regional-wise importance degree. Using our WinDB approach, we have released a new PanopticVideo-300 dataset, containing 300 panoptic clips covering over 225 categories. Specifically, since using WinDB to collect fixations is blind zoom free, there exists frequent and intensive "fixation shifting" - a very special phenomenon that has long been overlooked by the previous research - in our new set. Thus, we present an effective fixation shifting network (FishNet) to conquer it. All these new fixation collection tool, dataset, and network could be very potential to open a new age for fixation-related research and applications in 360o environments.
Collapse
|
5
|
Tang Y, Li M. DMGNet: Depth mask guiding network for RGB-D salient object detection. Neural Netw 2024; 180:106751. [PMID: 39332209 DOI: 10.1016/j.neunet.2024.106751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 07/26/2024] [Accepted: 09/19/2024] [Indexed: 09/29/2024]
Abstract
Though depth images can provide supplementary spatial structural cues for salient object detection (SOD) task, inappropriate utilization of depth features may introduce noisy or misleading features, which may greatly destroy SOD performance. To address this issue, we propose a depth mask guiding network (DMGNet) for RGB-D SOD. In this network, a depth mask guidance module (DMGM) is designed to pre-segment the salient objects from depth images and then create masks using pre-segmented objects to guide the RGB subnetwork to extract more discriminative features. Furthermore, a feature fusion pyramid module (FFPM) is employed to acquire more informative fused features using multi-branch convolutional channels with varying receptive fields, further enhancing the fusion of cross-modal features. Extensive experiments on nine benchmark datasets demonstrate the effectiveness of the proposed network.
Collapse
Affiliation(s)
- Yinggan Tang
- School of Electrical Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China; Key Laboratory of Intelligent Rehabilitation and Neromodulation of Hebei Province, Yanshan University, Qinhuangdao, Hebei 066004, China; Key Laboratory of Industrial Computer Control Engineering of Hebei Province, Yanshan University, Qinhuangdao, Hebei 066004, China.
| | - Mengyao Li
- School of Electrical Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China.
| |
Collapse
|
6
|
Qiao M, Xu M, Jiang L, Lei P, Wen S, Chen Y, Sigal L. HyperSOR: Context-Aware Graph Hypernetwork for Salient Object Ranking. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:5873-5889. [PMID: 38381637 DOI: 10.1109/tpami.2024.3368158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Salient object ranking (SOR) aims to segment salient objects in an image and simultaneously predict their saliency rankings, according to the shifted human attention over different objects. The existing SOR approaches mainly focus on object-based attention, e.g., the semantic and appearance of object. However, we find that the scene context plays a vital role in SOR, in which the saliency ranking of the same object varies a lot at different scenes. In this paper, we thus make the first attempt towards explicitly learning scene context for SOR. Specifically, we establish a large-scale SOR dataset of 24,373 images with rich context annotations, i.e., scene graphs, segmentation, and saliency rankings. Inspired by the data analysis on our dataset, we propose a novel graph hypernetwork, named HyperSOR, for context-aware SOR. In HyperSOR, an initial graph module is developed to segment objects and construct an initial graph by considering both geometry and semantic information. Then, a scene graph generation module with multi-path graph attention mechanism is designed to learn semantic relationships among objects based on the initial graph. Finally, a saliency ranking prediction module dynamically adopts the learned scene context through a novel graph hypernetwork, for inferring the saliency rankings. Experimental results show that our HyperSOR can significantly improve the performance of SOR.
Collapse
|
7
|
Liu X, Wang L. MSRMNet: Multi-scale skip residual and multi-mixed features network for salient object detection. Neural Netw 2024; 173:106144. [PMID: 38335792 DOI: 10.1016/j.neunet.2024.106144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 12/08/2023] [Accepted: 01/22/2024] [Indexed: 02/12/2024]
Abstract
The current models for the salient object detection (SOD) have made remarkable progress through multi-scale feature fusion strategies. However, the existing models have large deviations in the detection of different scales, and the target boundaries of the prediction images are still blurred. In this paper, we propose a new model addressing these issues using a transformer backbone to capture multiple feature layers. The model uses multi-scale skip residual connections during encoding to improve the accuracy of the model's predicted object position and edge pixel information. Furthermore, to extract richer multi-scale semantic information, we perform multiple mixed feature operations in the decoding stage. In addition, we add the structure similarity index measure (SSIM) function with coefficients in the loss function to enhance the accurate prediction performance of the boundaries. Experiments demonstrate that our algorithm achieves state-of-the-art results on five public datasets, and improves the performance metrics of the existing SOD tasks. Codes and results are available at: https://github.com/xxwudi508/MSRMNet.
Collapse
Affiliation(s)
- Xinlong Liu
- Sun Yat-Sen University, Guangzhou 510275, China.
| | - Luping Wang
- Sun Yat-Sen University, Guangzhou 510275, China.
| |
Collapse
|
8
|
Peng Y, Zhai Z, Feng M. SLMSF-Net: A Semantic Localization and Multi-Scale Fusion Network for RGB-D Salient Object Detection. SENSORS (BASEL, SWITZERLAND) 2024; 24:1117. [PMID: 38400274 PMCID: PMC10892948 DOI: 10.3390/s24041117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 02/02/2024] [Accepted: 02/06/2024] [Indexed: 02/25/2024]
Abstract
Salient Object Detection (SOD) in RGB-D images plays a crucial role in the field of computer vision, with its central aim being to identify and segment the most visually striking objects within a scene. However, optimizing the fusion of multi-modal and multi-scale features to enhance detection performance remains a challenge. To address this issue, we propose a network model based on semantic localization and multi-scale fusion (SLMSF-Net), specifically designed for RGB-D SOD. Firstly, we designed a Deep Attention Module (DAM), which extracts valuable depth feature information from both channel and spatial perspectives and efficiently merges it with RGB features. Subsequently, a Semantic Localization Module (SLM) is introduced to enhance the top-level modality fusion features, enabling the precise localization of salient objects. Finally, a Multi-Scale Fusion Module (MSF) is employed to perform inverse decoding on the modality fusion features, thus restoring the detailed information of the objects and generating high-precision saliency maps. Our approach has been validated across six RGB-D salient object detection datasets. The experimental results indicate an improvement of 0.20~1.80%, 0.09~1.46%, 0.19~1.05%, and 0.0002~0.0062, respectively in maxF, maxE, S, and MAE metrics, compared to the best competing methods (AFNet, DCMF, and C2DFNet).
Collapse
Affiliation(s)
- Yanbin Peng
- School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
| | | | | |
Collapse
|
9
|
Wu YH, Liu Y, Zhan X, Cheng MM. P2T: Pyramid Pooling Transformer for Scene Understanding. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:12760-12771. [PMID: 36040936 DOI: 10.1109/tpami.2022.3202765] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.
Collapse
|
10
|
Li H, Feng CM, Xu Y, Zhou T, Yao L, Chang X. Zero-Shot Camouflaged Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:5126-5137. [PMID: 37643103 DOI: 10.1109/tip.2023.3308295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
The goal of Camouflaged object detection (COD) is to detect objects that are visually embedded in their surroundings. Existing COD methods only focus on detecting camouflaged objects from seen classes, while they suffer from performance degradation to detect unseen classes. However, in a real-world scenario, collecting sufficient data for seen classes is extremely difficult and labeling them requires high professional skills, thereby making these COD methods not applicable. In this paper, we propose a new zero-shot COD framework (termed as ZSCOD), which can effectively detect the never unseen classes. Specifically, our framework includes a Dynamic Graph Searching Network (DGSNet) and a Camouflaged Visual Reasoning Generator (CVRG). In details, DGSNet is proposed to adaptively capture more edge details for boosting the COD performance. CVRG is utilized to produce pseudo-features that are closer to the real features of the seen camouflaged objects, which can transfer knowledge from seen classes to unseen classes to help detect unseen objects. Besides, our graph reasoning is built on a dynamic searching strategy, which can pay more attention to the boundaries of objects for reducing the influences of background. More importantly, we construct the first zero-shot COD benchmark based on the COD10K dataset. Experimental results on public datasets show that our ZSCOD not only detects the camouflaged object of unseen classes but also achieves state-of-the-art performance in detecting seen classes.
Collapse
|
11
|
Zhang Y, Zhang J, Li W, Yin H, He L. Automatic Detection System for Velopharyngeal Insufficiency Based on Acoustic Signals from Nasal and Oral Channels. Diagnostics (Basel) 2023; 13:2714. [PMID: 37627973 PMCID: PMC10453249 DOI: 10.3390/diagnostics13162714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 08/18/2023] [Accepted: 08/19/2023] [Indexed: 08/27/2023] Open
Abstract
Velopharyngeal insufficiency (VPI) is a type of pharyngeal function dysfunction that causes speech impairment and swallowing disorder. Speech therapists play a key role on the diagnosis and treatment of speech disorders. However, there is a worldwide shortage of experienced speech therapists. Artificial intelligence-based computer-aided diagnosing technology could be a solution for this. This paper proposes an automatic system for VPI detection at the subject level. It is a non-invasive and convenient approach for VPI diagnosis. Based on the principle of impaired articulation of VPI patients, nasal- and oral-channel acoustic signals are collected as raw data. The system integrates the symptom discriminant results at the phoneme level. For consonants, relative prominent frequency description and relative frequency distribution features are proposed to discriminate nasal air emission caused by VPI. For hypernasality-sensitive vowels, a cross-attention residual Siamese network (CARS-Net) is proposed to perform automatic VPI/non-VPI classification at the phoneme level. CARS-Net embeds a cross-attention module between the two branches to improve the VPI/non-VPI classification model for vowels. We validate the proposed system on a self-built dataset, and the accuracy reaches 98.52%. This provides possibilities for implementing automatic VPI diagnosis.
Collapse
Affiliation(s)
- Yu Zhang
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China; (Y.Z.); (J.Z.); (W.L.)
| | - Jing Zhang
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China; (Y.Z.); (J.Z.); (W.L.)
| | - Wen Li
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China; (Y.Z.); (J.Z.); (W.L.)
| | - Heng Yin
- West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China;
| | - Ling He
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China; (Y.Z.); (J.Z.); (W.L.)
| |
Collapse
|
12
|
Jiao S, Goel V, Navasardyan S, Yang Z, Khachatryan L, Yang Y, Wei Y, Zhao Y, Shi H. Collaborative Content-Dependent Modeling: A Return to the Roots of Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:4237-4246. [PMID: 37440395 DOI: 10.1109/tip.2023.3293759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/15/2023]
Abstract
Salient object detection (SOD) aims to identify the most visually distinctive object(s) from each given image. Most recent progresses focus on either adding elaborative connections among different convolution blocks or introducing boundary-aware supervision to help achieve better segmentation, which is actually moving away from the essence of SOD, i.e., distinctiveness/salience. This paper goes back to the roots of SOD and investigates the principles of how to identify distinctive object(s) in a more effective and efficient way. Intuitively, the salience of one object should largely depend on its global context within the input image. Based on this, we devise a clean yet effective architecture for SOD, named Collaborative Content-Dependent Networks (CCD-Net). In detail, we propose a collaborative content-dependent head whose parameters are conditioned on the input image's global context information. Within the content-dependent head, a hand-crafted multi-scale (HMS) module and a self-induced (SI) module are carefully designed to collaboratively generate content-aware convolution kernels for prediction. Benefited from the content-dependent head, CCD-Net is capable of leveraging global context to detect distinctive object(s) while keeping a simple encoder-decoder design. Extensive experimental results demonstrate that our CCD-Net achieves state-of-the-art results on various benchmarks. Our architecture is simple and intuitive compared to previous solutions, resulting in competitive characteristics with respect to model complexity, operating efficiency, and segmentation accuracy.
Collapse
|
13
|
Ma M, Xia C, Xie C, Chen X, Li J. Boosting Broader Receptive Fields for Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1026-1038. [PMID: 37018243 DOI: 10.1109/tip.2022.3232209] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Salient Object Detection has boomed in recent years and achieved impressive performance on regular-scale targets. However, existing methods encounter performance bottlenecks in processing objects with scale variation, especially extremely large- or small-scale objects with asymmetric segmentation requirements, since they are inefficient in obtaining more comprehensive receptive fields. With this issue in mind, this paper proposes a framework named BBRF for Boosting Broader Receptive Fields, which includes a Bilateral Extreme Stripping (BES) encoder, a Dynamic Complementary Attention Module (DCAM) and a Switch-Path Decoder (SPD) with a new boosting loss under the guidance of Loop Compensation Strategy (LCS). Specifically, we rethink the characteristics of the bilateral networks, and construct a BES encoder that separates semantics and details in an extreme way so as to get the broader receptive fields and obtain the ability to perceive extreme large- or small-scale objects. Then, the bilateral features generated by the proposed BES encoder can be dynamically filtered by the newly proposed DCAM. This module interactively provides spacial-wise and channel-wise dynamic attention weights for the semantic and detail branches of our BES encoder. Furthermore, we subsequently propose a Loop Compensation Strategy to boost the scale-specific features of multiple decision paths in SPD. These decision paths form a feature loop chain, which creates mutually compensating features under the supervision of boosting loss. Experiments on five benchmark datasets demonstrate that the proposed BBRF has a great advantage to cope with scale variation and can reduce the Mean Absolute Error over 20% compared with the state-of-the-art methods.
Collapse
|
14
|
Ullah I, Jian M, Shaheed K, Hussain S, Ma Y, Xu L, Muhammad K. AWANet: Attentive-Aware Wide-Kernels Asymmetrical Network with Blended Contour Information for Salient Object Detection. SENSORS (BASEL, SWITZERLAND) 2022; 22:9667. [PMID: 36560036 PMCID: PMC9785171 DOI: 10.3390/s22249667] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/30/2022] [Revised: 12/01/2022] [Accepted: 12/05/2022] [Indexed: 06/17/2023]
Abstract
Although deep learning-based techniques for salient object detection have considerably improved over recent years, estimated saliency maps still exhibit imprecise predictions owing to the internal complexity and indefinite boundaries of salient objects of varying sizes. Existing methods emphasize the design of an exemplary structure to integrate multi-level features by employing multi-scale features and attention modules to filter salient regions from cluttered scenarios. We propose a saliency detection network based on three novel contributions. First, we use a dense feature extraction unit (DFEU) by introducing large kernels of asymmetric and grouped-wise convolutions with channel reshuffling. The DFEU extracts semantically enriched features with large receptive fields and reduces the gridding problem and parameter sizes for subsequent operations. Second, we suggest a cross-feature integration unit (CFIU) that extracts semantically enriched features from their high resolutions using dense short connections and sub-samples the integrated information into different attentional branches based on the inputs received for each stage of the backbone. The embedded independent attentional branches can observe the importance of the sub-regions for a salient object. With the constraint-wise growth of the sub-attentional branches at various stages, the CFIU can efficiently avoid global and local feature dilution effects by extracting semantically enriched features via dense short-connections from high and low levels. Finally, a contour-aware saliency refinement unit (CSRU) was devised by blending the contour and contextual features in a progressive dense connected fashion to assist the model toward obtaining more accurate saliency maps with precise boundaries in complex and perplexing scenarios. Our proposed model was analyzed with ResNet-50 and VGG-16 and outperforms most contemporary techniques with fewer parameters.
Collapse
Affiliation(s)
- Inam Ullah
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China
| | - Muwei Jian
- School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan 250014, China
| | - Kashif Shaheed
- School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
| | - Sumaira Hussain
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China
| | - Yuling Ma
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China
| | - Lixian Xu
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China
| | - Khan Muhammad
- Visual Analytics for Knowledge Laboratory (VIS2KNOW Lab), Department of Applied Artificial Intelligence, School of Convergence, College of Computing and Informatics, Sungkyunkwan University, Seoul 03063, Republic of Korea
| |
Collapse
|
15
|
Complementary Segmentation of Primary Video Objects with Reversible Flows. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12157781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Segmenting primary objects in a video is an important yet challenging problem in intelligent video surveillance, as it exhibits various levels of foreground/background ambiguities. To reduce such ambiguities, we propose a novel formulation via exploiting foreground and background context as well as their complementary constraint. Under this formulation, a unified objective function is further defined to encode each cue. For implementation, we design a complementary segmentation network (CSNet) with two separate branches, which can simultaneously encode the foreground and background information along with joint spatial constraints. The CSNet is trained on massive images with manually annotated salient objects in an end-to-end manner. By applying CSNet on each video frame, the spatial foreground and background maps can be initialized. To enforce temporal consistency effectively and efficiently, we divide each frame into superpixels and construct a neighborhood reversible flow that reflects the most reliable temporal correspondences between superpixels in far-away frames. With such a flow, the initialized foregroundness and backgroundness can be propagated along the temporal dimension so that primary video objects gradually pop out and distractors are well suppressed. Extensive experimental results on three video datasets show that the proposed approach achieves impressive performance in comparisons with 22 state-of-the-art models.
Collapse
|
16
|
Abstract
Indoor object detection and tracking using millimeter-wave (mmWave) radar sensors have received much attention recently due to the emergence of applications of energy assignment, privacy, health, and safety. Increasing the valid field of view of the system and accuracy through multi-sensors is critical to achieving an efficient tracking system. This paper uses two mmWave radar sensors for accurate object detection and tracking: two noise reduction stages to reduce noise and distinguish cluster groups. The presented data fusion method effectively estimates the transformation of the data alignment and synchronizes the result that can allow us to visualize the objects’ information acquired by one radar on another one. An efficient density-based clustering algorithm to provide high clustering accuracy is presented. The Unscented Kalman Filter tracking algorithm with data association tracks multiple objects simultaneously in terms of accuracy and timing. Furthermore, an indoor object tracking system is developed based on our proposed method. Finally, the proposed method is validated by comparing it with our previous system and a commercial system. The experimental results demonstrate that the proposed method’s advantage is of positive significance for handling the effect of occlusions at higher numbers of weak data and for detecting and tracking each object more accurately.
Collapse
|