1
|
Liu N, Nan K, Zhao W, Yao X, Han J. Learning Complementary Spatial-Temporal Transformer for Video Salient Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:10663-10673. [PMID: 37027778 DOI: 10.1109/tnnls.2023.3243246] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Besides combining appearance and motion information, another crucial factor for video salient object detection (VSOD) is to mine spatial-temporal (ST) knowledge, including complementary long-short temporal cues and global-local spatial context from neighboring frames. However, the existing methods only explored part of them and ignored their complementarity. In this article, we propose a novel complementary ST transformer (CoSTFormer) for VSOD, which has a short-global branch and a long-local branch to aggregate complementary ST contexts. The former integrates the global context from the neighboring two frames using dense pairwise attention, while the latter is designed to fuse long-term temporal information from more consecutive frames with local attention windows. In this way, we decompose the ST context into a short-global part and a long-local part and leverage the powerful transformer to model the context relationship and learn their complementarity. To solve the contradiction between local window attention and object motion, we propose a novel flow-guided window attention (FGWA) mechanism to align the attention windows with object and camera movements. Furthermore, we deploy CoSTFormer on fused appearance and motion features, thus enabling the effective combination of all three VSOD factors. Besides, we present a pseudo video generation method to synthesize sufficient video clips from static images for training ST saliency models. Extensive experiments have verified the effectiveness of our method and illustrated that we achieve new state-of-the-art results on several benchmark datasets.
Collapse
|
2
|
STI-Net: Spatiotemporal Integration Network for Video Saliency Detection. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
|
3
|
Haller E, Florea AM, Leordeanu M. Iterative Knowledge Exchange Between Deep Learning and Space-Time Spectral Clustering for Unsupervised Segmentation in Videos. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:7638-7656. [PMID: 34648435 DOI: 10.1109/tpami.2021.3120228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We propose a dual system for unsupervised object segmentation in video, which brings together two modules with complementary properties: a space-time graph that discovers objects in videos and a deep network that learns powerful object features. The system uses an iterative knowledge exchange policy. A novel spectral space-time clustering process on the graph produces unsupervised segmentation masks passed to the network as pseudo-labels. The net learns to segment in single frames what the graph discovers in video and passes back to the graph strong image-level features that improve its node-level features in the next iteration. Knowledge is exchanged for several cycles until convergence. The graph has one node per each video pixel, but the object discovery is fast. It uses a novel power iteration algorithm computing the main space-time cluster as the principal eigenvector of a special Feature-Motion matrix without actually computing the matrix. The thorough experimental analysis validates our theoretical claims and proves the effectiveness of the cyclical knowledge exchange. We also perform experiments on the supervised scenario, incorporating features pretrained with human supervision. We achieve state-of-the-art level on unsupervised and supervised scenarios on four challenging datasets: DAVIS, SegTrack, YouTube-Objects, and DAVSOD. We will make our code publicly available.
Collapse
|
4
|
Yue H, Guo J, Yin X, Zhang Y, Zheng S, Zhang Z, Li C. Salient object detection in low-light images via functional optimization-inspired feature polishing. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
5
|
Huang K, Tian C, Su J, Lin JCW. Transformer-based Cross Reference Network for video salient object detection. Pattern Recognit Lett 2022. [DOI: 10.1016/j.patrec.2022.06.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
6
|
Nicora E, Noceti N. On the Use of Efficient Projection Kernels for Motion-Based Visual Saliency Estimation. FRONTIERS IN COMPUTER SCIENCE 2022. [DOI: 10.3389/fcomp.2022.867289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In this paper, we investigate the potential of a family of efficient filters—the Gray-Code Kernels (GCKs)—for addressing visual saliency estimation with a focus on motion information. Our implementation relies on the use of 3D kernels applied to overlapping blocks of frames and is able to gather meaningful spatio-temporal information with a very light computation. We introduce an attention module that reasons the use of pooling strategies, combined in an unsupervised way to derive a saliency map highlighting the presence of motion in the scene. A coarse segmentation map can also be obtained. In the experimental analysis, we evaluate our method on publicly available datasets and show that it is able to effectively and efficiently identify the portion of the image where the motion is occurring, providing tolerance to a variety of scene conditions and complexities.
Collapse
|
7
|
Bi H, Zhu H, Yang L, Wu R. Multi-Scale Attention and Encoder-Decoder Network for Video Saliency Object Detection. PATTERN RECOGNITION AND IMAGE ANALYSIS 2022. [DOI: 10.1134/s1054661822020031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
8
|
Spatiotemporal context-aware network for video salient object detection. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07330-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
9
|
A novel spatiotemporal attention enhanced discriminative network for video salient object detection. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02649-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
10
|
Wu R, Li S, Chen C, Hao A. Improving video anomaly detection performance by mining useful data from unseen video frames. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.05.112] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
11
|
Jian M, Wang J, Yu H, Wang GG. Integrating object proposal with attention networks for video saliency detection. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.08.069] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
12
|
|
13
|
|
14
|
Ji Y, Zhang H, Jie Z, Ma L, Jonathan Wu QM. CASNet: A Cross-Attention Siamese Network for Video Salient Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:2676-2690. [PMID: 32692684 DOI: 10.1109/tnnls.2020.3007534] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Recent works on video salient object detection have demonstrated that directly transferring the generalization ability of image-based models to video data without modeling spatial-temporal information remains nontrivial and challenging. Considering both intraframe accuracy and interframe consistency of saliency detection, this article presents a novel cross-attention based encoder-decoder model under the Siamese framework (CASNet) for video salient object detection. A baseline encoder-decoder model trained with Lovász softmax loss function is adopted as a backbone network to guarantee the accuracy of intraframe salient object detection. Self- and cross-attention modules are incorporated into our model in order to preserve the saliency correlation and improve intraframe salient detection consistency. Extensive experimental results obtained by ablation analysis and cross-data set validation demonstrate the effectiveness of our proposed method. Quantitative results indicate that our CASNet model outperforms 19 state-of-the-art image- and video-based methods on six benchmark data sets.
Collapse
|
15
|
Towards accurate RGB-D saliency detection with complementary attention and adaptive integration. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.125] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
16
|
Ma G, Li S, Chen C, Hao A, Qin H. Rethinking Image Salient Object Detection: Object-Level Semantic Saliency Reranking First, Pixelwise Saliency Refinement Later. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4238-4252. [PMID: 33819154 DOI: 10.1109/tip.2021.3068649] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Human attention is an interactive activity between our visual system and our brain, using both low-level visual stimulus and high-level semantic information. Previous image salient object detection (SOD) studies conduct their saliency predictions via a multitask methodology in which pixelwise saliency regression and segmentation-like saliency refinement are conducted simultaneously. However, this multitask methodology has one critical limitation: the semantic information embedded in feature backbones might be degenerated during the training process. Our visual attention is determined mainly by semantic information, which is evidenced by our tendency to pay more attention to semantically salient regions even if these regions are not the most perceptually salient at first glance. This fact clearly contradicts the widely used multitask methodology mentioned above. To address this issue, this paper divides the SOD problem into two sequential steps. First, we devise a lightweight, weakly supervised deep network to coarsely locate the semantically salient regions. Next, as a postprocessing refinement, we selectively fuse multiple off-the-shelf deep models on the semantically salient regions identified by the previous step to formulate a pixelwise saliency map. Compared with the state-of-the-art (SOTA) models that focus on learning the pixelwise saliency in single images using only perceptual clues, our method aims at investigating the object-level semantic ranks between multiple images, of which the methodology is more consistent with the human attention mechanism. Our method is simple yet effective, and it is the first attempt to consider salient object detection as mainly an object-level semantic reranking problem.
Collapse
|
17
|
Xu M, Fu P, Liu B, Li J. Multi-Stream Attention-Aware Graph Convolution Network for Video Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4183-4197. [PMID: 33822725 DOI: 10.1109/tip.2021.3070200] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recent advances in deep convolution neural networks (CNNs) boost the development of video salient object detection (SOD), and many remarkable deep-CNNs video SOD models have been proposed. However, many existing deep-CNNs video SOD models still suffer from coarse boundaries of the salient object, which may be attributed to the loss of high-frequency information. The traditional graph-based video SOD models can preserve object boundaries well by conducting superpixels/supervoxels segmentation in advance, but they perform weaker in highlighting the whole object than the latest deep-CNNs models, limited by heuristic graph clustering algorithms. To tackle this problem, we find a new way to address this issue under the framework of graph convolution networks (GCNs), taking advantage of graph model and deep neural network. Specifically, a superpixel-level spatiotemporal graph is first constructed among multiple frame-pairs by exploiting the motion cues implied in the frame-pairs. Then the graph data is imported into the devised multi-stream attention-aware GCN, where a novel Edge-Gated graph convolution (GC) operation is proposed to boost the saliency information aggregation on the graph data. A novel attention module is designed to encode the spatiotemporal sematic information via adaptive selection of graph nodes and fusion of the static-specific and the motion-specific graph embedding. Finally, a smoothness-aware regularization term is proposed to enhance the uniformity of salient object. Graph nodes (superpixels) inherently belonging to the same class will be ideally clustered together in the learned embedding space. Extensive experiments have been conducted on three widely used datasets. Compared with fourteen state-of-the-art video SOD models, our proposed method can well retain the salient object boundaries and possess a strong learning ability, which shows that this work is a good practice for designing GCNs for video SOD.
Collapse
|
18
|
Puttagunta M, Ravi S. Medical image analysis based on deep learning approach. MULTIMEDIA TOOLS AND APPLICATIONS 2021; 80:24365-24398. [PMID: 33841033 PMCID: PMC8023554 DOI: 10.1007/s11042-021-10707-4] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 11/28/2020] [Accepted: 02/10/2021] [Indexed: 05/05/2023]
Abstract
Medical imaging plays a significant role in different clinical applications such as medical procedures used for early detection, monitoring, diagnosis, and treatment evaluation of various medical conditions. Basicsof the principles and implementations of artificial neural networks and deep learning are essential for understanding medical image analysis in computer vision. Deep Learning Approach (DLA) in medical image analysis emerges as a fast-growing research field. DLA has been widely used in medical imaging to detect the presence or absence of the disease. This paper presents the development of artificial neural networks, comprehensive analysis of DLA, which delivers promising medical imaging applications. Most of the DLA implementations concentrate on the X-ray images, computerized tomography, mammography images, and digital histopathology images. It provides a systematic review of the articles for classification, detection, and segmentation of medical images based on DLA. This review guides the researchers to think of appropriate changes in medical image analysis based on DLA.
Collapse
Affiliation(s)
- Muralikrishna Puttagunta
- Department of Computer Science, School of Engineering and Technology, Pondicherry University, Pondicherry, India
| | - S. Ravi
- Department of Computer Science, School of Engineering and Technology, Pondicherry University, Pondicherry, India
| |
Collapse
|
19
|
Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H. Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:3995-4007. [PMID: 33784620 DOI: 10.1109/tip.2021.3068644] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
We have witnessed a growing interest in video salient object detection (VSOD) techniques in today's computer vision applications. In contrast with temporal information (which is still considered a rather unstable source thus far), the spatial information is more stable and ubiquitous, thus it could influence our vision system more. As a result, the current main-stream VSOD approaches have inferred and obtained their saliency primarily from the spatial perspective, still treating temporal information as subordinate. Although the aforementioned methodology of focusing on the spatial aspect is effective in achieving a numeric performance gain, it still has two critical limitations. First, to ensure the dominance by the spatial information, its temporal counterpart remains inadequately used, though in some complex video scenes, the temporal information may represent the only reliable data source, which is critical to derive the correct VSOD. Second, both spatial and temporal saliency cues are often computed independently in advance and then integrated later on, while the interactions between them are omitted completely, resulting in saliency cues with limited quality. To combat these challenges, this paper advocates a novel spatiotemporal network, where the key innovation is the design of its temporal unit. Compared with other existing competitors (e.g., convLSTM), the proposed temporal unit exhibits an extremely lightweight design that does not degrade its strong ability to sense temporal information. Furthermore, it fully enables the computation of temporal saliency cues that interact with their spatial counterparts, ultimately boosting the overall VSOD performance and realizing its full potential towards mutual performance improvement for each. The proposed method is easy to implement yet still effective, achieving high-quality VSOD at 50 FPS in real-time applications.
Collapse
|
20
|
Wang X, Li S, Chen C, Hao A, Qin H. Depth quality-aware selective saliency fusion for RGB-D image salient object detection. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.071] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
21
|
Yan X, Chen Z, Wu QMJ, Lu M, Sun L. 3MNet: Multi-task, multi-level and multi-channel feature aggregation network for salient object detection. MACHINE VISION AND APPLICATIONS 2021; 32:45. [PMID: 33623184 PMCID: PMC7891124 DOI: 10.1007/s00138-021-01172-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Revised: 01/02/2021] [Accepted: 01/22/2021] [Indexed: 06/12/2023]
Abstract
Salient object detection is a hot spot of current computer vision. The emergence of the convolutional neural network (CNN) greatly improves the existing detection methods. In this paper, we present 3MNet, which is based on the CNN, to make the utmost of various features of the image and utilize the contour detection task of the salient object to explicitly model the features of multi-level structures, multiple tasks and multiple channels, so as to obtain the final saliency map of the fusion of these features. Specifically, we first utilize contour detection task for auxiliary detection and then utilize use multi-layer network structure to extract multi-scale image information. Finally, we introduce a unique module into the network to model the channel information of the image. Our network has produced good results on five widely used datasets. In addition, we also conducted a series of ablation experiments to verify the effectiveness of some components in the network.
Collapse
Affiliation(s)
- Xinghe Yan
- School of Control Science and Engineering, Shandong University, Jinan, 250061 China
| | - Zhenxue Chen
- School of Control Science and Engineering, Shandong University, Jinan, 250061 China
- Shenzhen Research Institute of Shandong University, Shandong University, Shenzhen, 518057 China
| | - Q. M. Jonathan Wu
- Department of Electrical and Computer Engineering, University of Windsor, Windsor, N9B 3P4 Canada
| | - Mengxu Lu
- School of Control Science and Engineering, Shandong University, Jinan, 250061 China
| | - Luna Sun
- School of Control Science and Engineering, Shandong University, Jinan, 250061 China
| |
Collapse
|
22
|
CNN-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.09.003] [Citation(s) in RCA: 71] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
23
|
|
24
|
Chen C, Wei J, Peng C, Qin H. Depth-Quality-Aware Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:2350-2363. [PMID: 33481710 DOI: 10.1109/tip.2021.3052069] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The existing fusion-based RGB-D salient object detection methods usually adopt the bistream structure to strike a balance in the fusion trade-off between RGB and depth (D). While the D quality usually varies among the scenes, the state-of-the-art bistream approaches are depth-quality-unaware, resulting in substantial difficulties in achieving complementary fusion status between RGB and D and leading to poor fusion results for low-quality D. Thus, this paper attempts to integrate a novel depth-quality-aware subnet into the classic bistream structure in order to assess the depth quality prior to conducting the selective RGB-D fusion. Compared to the SOTA bistream methods, the major advantage of our method is its ability to lessen the importance of the low-quality, no-contribution, or even negative-contribution D regions during RGB-D fusion, achieving a much improved complementary status between RGB and D. Our source code and data are available online at https://github.com/qdu1995/DQSD.
Collapse
|
25
|
Bi H, Yang L, Zhu H, Lu D, Jiang J. STEG-Net: Spatio-Temporal Edge Guidance Network for Video Salient Object Detection. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2021.3078824] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
26
|
Li C, Cong R, Kwong S, Hou J, Fu H, Zhu G, Zhang D, Huang Q. ASIF-Net: Attention Steered Interweave Fusion Network for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:88-100. [PMID: 32078571 DOI: 10.1109/tcyb.2020.2969255] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Salient object detection from RGB-D images is an important yet challenging vision task, which aims at detecting the most distinctive objects in a scene by combining color information and depth constraints. Unlike prior fusion manners, we propose an attention steered interweave fusion network (ASIF-Net) to detect salient objects, which progressively integrates cross-modal and cross-level complementarity from the RGB image and corresponding depth map via steering of an attention mechanism. Specifically, the complementary features from RGB-D images are jointly extracted and hierarchically fused in a dense and interweaved manner. Such a manner breaks down the barriers of inconsistency existing in the cross-modal data and also sufficiently captures the complementarity. Meanwhile, an attention mechanism is introduced to locate the potential salient regions in an attention-weighted fashion, which advances in highlighting the salient objects and suppressing the cluttered background regions. Instead of focusing only on pixelwise saliency, we also ensure that the detected salient objects have the objectness characteristics (e.g., complete structure and sharp boundary) by incorporating the adversarial learning that provides a global semantic constraint for RGB-D salient object detection. Quantitative and qualitative experiments demonstrate that the proposed method performs favorably against 17 state-of-the-art saliency detectors on four publicly available RGB-D salient object detection datasets. The code and results of our method are available at https://github.com/Li-Chongyi/ASIF-Net.
Collapse
|
27
|
Ma G, Li S, Chen C, Hao A, Qin H. Stage-wise Salient Object Detection in 360° Omnidirectional Image via Object-level Semantical Saliency Ranking. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2020; 26:3535-3545. [PMID: 32941153 DOI: 10.1109/tvcg.2020.3023636] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
The 2D image based salient object detection (SOD) has been extensively explored, while the 360° omnidirectional image based SOD has received less research attention and there exist three major bottlenecks that are limiting its performance. Firstly, the currently available training data is insufficient for the training of 360° SOD deep model. Secondly, the visual distortions in 360° omnidirectional images usually result in large feature gap between 360° images and 2D images; consequently, the widely used stage-wise training-a widely-used solution to alleviate the training data shortage problem, becomes infeasible when conducing SOD in 360° omnidirectional images. Thirdly, the existing 360° SOD approach has followed a multi-task methodology that performs salient object localization and segmentation-like saliency refinement at the same time, being faced with extremely large problem domain, making the training data shortage dilemma even worse. To tackle all these issues, this paper divides the 360° SOD into a multi-staqe task, the key rationale of which is to decompose the original complex problem domain into sequential easy sub problems that only demand for small-scale training data. Meanwhile, we learn how to rank the "object-level semantical saliency", aiming to locate salient viewpoints and objects accurately. Specifically, to alleviate the training data shortage problem, we have released a novel dataset named 360-SSOD, containing 1,105 360° omnidirectional images with manually annotated object-level saliency ground truth, whose semantical distribution is more balanced than that of the existing dataset. Also, we have compared the proposed method with 13 SOTA methods, and all quantitative results have demonstrated the performance superiority.
Collapse
|
28
|
Wang X, Li S, Chen C, Fang Y, Hao A, Qin H. Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 30:458-471. [PMID: 33201813 DOI: 10.1109/tip.2020.3037470] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Existing RGB-D salient object detection methods treat depth information as an independent component to complement RGB and widely follow the bistream parallel network architecture. To selectively fuse the CNN features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bistream networks usually consist of two independent subbranches: one subbranch is used for RGB saliency, and the other aims for depth saliency. However, depth saliency is persistently inferior to the RGB saliency because the RGB component is intrinsically more informative than the depth component. The bistream architecture easily biases its subsequent fusion procedure to the RGB subbranch, leading to a performance bottleneck. In this paper, we propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction, where we cyclically convert the original 4-dimensional RGB-D into DGB, RDB and RGD. Then, a newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D, achieving a new SOTA performance.
Collapse
|
29
|
Bi HB, Lu D, Zhu HH, Yang LN, Guan HP. STA-Net: spatial-temporal attention network for video salient object detection. APPL INTELL 2020. [DOI: 10.1007/s10489-020-01961-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
30
|
Zong M, Wang R, Chen Z, Wang M, Wang X, Potgieter J. Multi-cue based 3D residual network for action recognition. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05313-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
31
|
Motion Saliency Detection for Surveillance Systems Using Streaming Dynamic Mode Decomposition. Symmetry (Basel) 2020. [DOI: 10.3390/sym12091397] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Intelligent surveillance systems enable secured visibility features in the smart city era. One of the major models for pre-processing in intelligent surveillance systems is known as saliency detection, which provides facilities for multiple tasks such as object detection, object segmentation, video coding, image re-targeting, image-quality assessment, and image compression. Traditional models focus on improving detection accuracy at the cost of high complexity. However, these models are computationally expensive for real-world systems. To cope with this issue, we propose a fast-motion saliency method for surveillance systems under various background conditions. Our method is derived from streaming dynamic mode decomposition (s-DMD), which is a powerful tool in data science. First, DMD computes a set of modes in a streaming manner to derive spatial–temporal features, and a raw saliency map is generated from the sparse reconstruction process. Second, the final saliency map is refined using a difference-of-Gaussians filter in the frequency domain. The effectiveness of the proposed method is validated on a standard benchmark dataset. The experimental results show that the proposed method achieves competitive accuracy with lower complexity than state-of-the-art methods, which satisfies requirements in real-time applications.
Collapse
|
32
|
Huang K, Li G, Liu S. Learning channel-wise spatio-temporal representations for video salient object detection. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.04.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
33
|
Recent Advances in Saliency Estimation for Omnidirectional Images, Image Groups, and Video Sequences. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10155143] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
We present a review of methods for automatic estimation of visual saliency: the perceptual property that makes specific elements in a scene stand out and grab the attention of the viewer. We focus on domains that are especially recent and relevant, as they make saliency estimation particularly useful and/or effective: omnidirectional images, image groups for co-saliency, and video sequences. For each domain, we perform a selection of recent methods, we highlight their commonalities and differences, and describe their unique approaches. We also report and analyze the datasets involved in the development of such methods, in order to reveal additional peculiarities of each domain, such as the representation used for the ground truth saliency information (scanpaths, saliency maps, or salient object regions). We define domain-specific evaluation measures, and provide quantitative comparisons on the basis of common datasets and evaluation criteria, highlighting the different impact of existing approaches on each domain. We conclude by synthesizing the emerging directions for research in the specialized literature, which include novel representations for omnidirectional images, inter- and intra- image saliency decomposition for co-saliency, and saliency shift for video saliency estimation.
Collapse
|
34
|
Peng C, Chen Y, Kang Z, Chen C, Cheng Q. Robust principal component analysis: A factorization-based approach with linear complexity. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2019.09.074] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
35
|
Tang Y, Zou W, Hua Y, Jin Z, Li X. Video salient object detection via spatiotemporal attention neural networks. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.09.064] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
36
|
Chen C, Wei J, Peng C, Zhang W, Qin H, University Q, University SB. Improved Saliency Detection in RGB-D Images Using Two-phase Depth Estimation and Selective Deep Fusion. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:4296-4307. [PMID: 32012011 DOI: 10.1109/tip.2020.2968250] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
To solve the saliency detection problem in RGB-D images, the depth information plays a critical role in distinguishing salient objects or foregrounds from cluttered backgrounds. As the complementary component to color information, the depth quality directly dictates the subsequent saliency detection performance. However, due to artifacts and the limitation of depth acquisition devices, the quality of the obtained depth varies tremendously across different scenarios. Consequently, conventional selective fusion-based RGB-D saliency detection methods may result in a degraded detection performance in cases containing salient objects with low color contrast coupled with a low depth quality. To solve this problem, we make our initial attempt to estimate additional high-quality depth information, which is denoted by Depth+. Serving as a complement to the original depth, Depth+ will be fed into our newly designed selective fusion network to boost the detection performance. To achieve this aim, we first retrieve a small group of images that are similar to the given input, and then the inter-image, nonlocal correspondences are built accordingly. Thus, by using these inter-image correspondences, the overall depth can be coarsely estimated by utilizing our newly designed depth-transferring strategy. Next, we build fine-grained, object-level correspondences coupled with a saliency prior to further improve the depth quality of the previous estimation. Compared to the original depth, our newly estimated Depth+ is potentially more informative for detection improvement. Finally, we feed both the original depth and the newly estimated Depth+ into our selective deep fusion network, whose key novelty is to achieve an optimal complementary balance to make better decisions toward improving saliency boundaries.
Collapse
|
37
|
Ding X, Lin W, Chen Z, Zhang X. Point Cloud Saliency Detection by Local and Global Feature Fusion. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:5379-5393. [PMID: 31170071 DOI: 10.1109/tip.2019.2918735] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Inspired by the characteristics of the human visual system, a novel method is proposed for detecting the visually salient regions on 3D point clouds. First, the local distinctness of each point is evaluated based on the difference with its local surroundings. Then, the point cloud is decomposed into small clusters, and the initial global rarity value of each cluster is calculated; a random walk ranking method is then used to introduce cluster-level global rarity refinement to each point in all the clusters. Finally, an optimization framework is proposed to integrate both the local distinctness and the global rarity values to obtain the final saliency detection result of the point cloud. We compare the proposed method with several relevant algorithms and apply it to some computer graphics applications, such as interest point detection, viewpoint selection, and mesh simplification. The experimental results demonstrate the superior performance of the proposed method.
Collapse
|
38
|
Fang Y, Zhang C, Huang H, Lei J. Visual Attention Prediction for Stereoscopic Video by Multi-Module Fully Convolutional Network. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:5253-5265. [PMID: 31107651 DOI: 10.1109/tip.2019.2916766] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Visual attention is an important mechanism in the human visual system (HVS) and there have been numerous saliency detection algorithms designed for 2D images/video recently. However, the research for fixation detection of stereoscopic video is still limited and challenging due to the complicated depth and motion information. In this paper, we design a novel multi-module fully convolutional network (MM-FCN) for fixation detection of stereoscopic video. Specifically, we design a fully convolutional network for spatial saliency prediction (S-FCN), where the initial spatial saliency map of stereoscopic video is learned by image database of object detection. Furthermore, the fully convolutional network for temporal saliency prediction (T-FCN) is constructed by combining saliency results from S-FCN and motion information from video frames. Finally, the fully convolutional network for depth fixation prediction (D-FCN) is designed to compute the final fixation map of stereoscopic video by learning depth features with spatiotemporal features from T-FCN. The experimental results show that the proposed MM-FCN can predict fixation results for stereoscopic video more effectively and efficiently than other related fixation prediction methods.
Collapse
|
39
|
Cong R, Lei J, Fu H, Porikli F, Huang Q, Hou C. Video Saliency Detection via Sparsity-Based Reconstruction and Propagation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:4819-4831. [PMID: 31059438 DOI: 10.1109/tip.2019.2910377] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Video saliency detection aims to continuously discover the motion-related salient objects from the video sequences. Since it needs to consider the spatial and temporal constraints jointly, video saliency detection is more challenging than image saliency detection. In this paper, we propose a new method to detect the salient objects in video based on sparse reconstruction and propagation. With the assistance of novel static and motion priors, a single-frame saliency model is first designed to represent the spatial saliency in each individual frame via the sparsity-based reconstruction. Then, through a progressive sparsity-based propagation, the sequential correspondence in the temporal space is captured to produce the inter-frame saliency map. Finally, these two maps are incorporated into a global optimization model to achieve spatio-temporal smoothness and global consistency of the salient object in the whole video. The experiments on three large-scale video saliency datasets demonstrate that the proposed method outperforms the state-of-the-art algorithms both qualitatively and quantitatively.
Collapse
|
40
|
Chen C, Wang G, Peng C, Zhang X, Qin H. Improved Robust Video Saliency Detection based on Long-term Spatial-temporal Information. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:1090-1100. [PMID: 31449017 DOI: 10.1109/tip.2019.2934350] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
This paper proposes to utilize supervised deep convolutional neural networks to take full advantage of the long-term spatial-temporal information in order to improve the video saliency detection performance. The conventional methods, which use the temporally neighbored frames solely, could easily encounter transient failure cases when the spatial-temporal saliency clues are less-trustworthy for a long period. To tackle the aforementioned limitation, we plan to identify those beyond-scope frames with trustworthy long-term saliency clues first and then align it with the current problem domain for an improved video saliency detection.
Collapse
|
41
|
Fang Y, Ding G, Li J, Fang Z. Deep3DSaliency: Deep Stereoscopic Video Saliency Detection Model by 3D Convolutional Networks. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 28:2305-2318. [PMID: 30530363 DOI: 10.1109/tip.2018.2885229] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Stereoscopic saliency detection plays an important role in various stereoscopic video processing applications. However, conventional stereoscopic video saliency detection methods mainly use independent low-level features instead of extracting them automatically, and thus, they ignore the intrinsic relationship between the spatial and temporal information. In this paper, we propose a novel stereoscopic video saliency detection method based on 3D convolutional neural networks, namely Deep 3D Video Saliency (Deep3DSaliency). The proposed network consists of two sub-models: Spatiotemporal Saliency Model (STSM), and Stereoscopic Saliency Aware Model (SSAM). STSM directly takes three consecutive video frames as the input to extract visual spatiotemporal features, while SSAM attempts to further infer the depth and semantic features from the left and right video frames by shared parameters from STSM. The visual spatiotemporal features from STSM, and the depth and semantic features from SSAM are learned by an alternating optimization scheme. Finally, all these saliency-related features are combined together for the final stereoscopic saliency detection via 3D deconvolution. Experimental results show the superior performance of the proposed model over other existing ones in saliency estimation for 3D video sequences.
Collapse
|
42
|
|
43
|
Qiu W, Gao X, Han B. Eye Fixation Assisted Video Saliency Detection via Total Variation-based Pairwise Interaction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:4724-4739. [PMID: 29993549 DOI: 10.1109/tip.2018.2843680] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
As human visual attention is naturally biased towards foreground objects in a scene, it can be used to extract salient objects in video clips. In this work, we proposed a weakly supervised learning based video saliency detection algorithm utilizing eye fixations information from multiple subjects. Our main idea is to extend eye fixations to saliency regions step by step. First, visual seeds are collected using multiple color space geodesic distance based seed region mapping with filtered and extended eye fixations. This operation helps raw fixation points spread to the most likely salient regions, namely, visual seed regions. Second, in order to seize the essential scene structure from video sequences, we introduce the total variance based pairwise interaction model to learn the potential pairwise relationship between foreground and background within a frame or across video frames. In this vein, visual seed regions eventually grow into salient regions. Compared with previous approaches the generated saliency maps has two most outstanding properties: integrity and purity, which are conductive to segment the foreground and significant to the follow-up tasks. Extensive quantitative and qualitative experiments on various video sequences demonstrate that the proposed method outperforms state-of-theart image and video saliency detection algorithms.
Collapse
|