1
|
Bruckert A, Christie M, Le Meur O. Where to look at the movies: Analyzing visual attention to understand movie editing. Behav Res Methods 2023; 55:2940-2959. [PMID: 36002630 DOI: 10.3758/s13428-022-01949-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/29/2022] [Indexed: 11/08/2022]
Abstract
In the process of making a movie, directors constantly care about where the spectator will look on the screen. Shot composition, framing, camera movements, or editing are tools commonly used to direct attention. In order to provide a quantitative analysis of the relationship between those tools and gaze patterns, we propose a new eye-tracking database, containing gaze-pattern information on movie sequences, as well as editing annotations, and we show how state-of-the-art computational saliency techniques behave on this dataset. In this work, we expose strong links between movie editing and spectators gaze distributions, and open several leads on how the knowledge of editing information could improve human visual attention modeling for cinematic content. The dataset generated and analyzed for this study is available at https://github.com/abruckert/eye_tracking_filmmaking.
Collapse
|
2
|
Kumari S, Shobha Amala VY, Nivethithan M, Chakravarthy VS. BIAS-3D: Brain inspired attentional search model fashioned after what and where/how pathways for target search in 3D environment. Front Comput Neurosci 2022; 16:1012559. [DOI: 10.3389/fncom.2022.1012559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 10/26/2022] [Indexed: 11/19/2022] Open
Abstract
We propose a brain inspired attentional search model for target search in a 3D environment, which has two separate channels—one for the object classification, analogous to the “what” pathway in the human visual system, and the other for prediction of the next location of the camera, analogous to the “where” pathway. To evaluate the proposed model, we generated 3D Cluttered Cube datasets that consist of an image on one vertical face, and clutter or background images on the other faces. The camera goes around each cube on a circular orbit and determines the identity of the image pasted on the face. The images pasted on the cube faces were drawn from: MNIST handwriting digit, QuickDraw, and RGB MNIST handwriting digit datasets. The attentional input of three concentric cropped windows resembling the high-resolution central fovea and low-resolution periphery of the retina, flows through a Classifier Network and a Camera Motion Network. The Classifier Network classifies the current view into one of the target classes or the clutter. The Camera Motion Network predicts the camera's next position on the orbit (varying the azimuthal angle or “θ”). Here the camera performs one of three actions: move right, move left, or do not move. The Camera-Position Network adds the camera's current position (θ) into the higher features level of the Classifier Network and the Camera Motion Network. The Camera Motion Network is trained using Q-learning where the reward is 1 if the classifier network gives the correct classification, otherwise 0. Total loss is computed by adding the mean square loss of temporal difference and cross entropy loss. Then the model is trained end-to-end by backpropagating the total loss using Adam optimizer. Results on two grayscale image datasets and one RGB image dataset show that the proposed model is successfully able to discover the desired search pattern to find the target face on the cube, and also classify the target face accurately.
Collapse
|
3
|
Biologically inspired image classifier based on saccadic eye movement design for convolutional neural networks. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
4
|
Zhang Q, Shi Y, Zhang X, Zhang L. Residual attentive feature learning network for salient object detection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.06.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
5
|
Pandey S, Harit G. Handwritten Annotation Spotting in Printed Documents Using Top-Down Visual Saliency Models. ACM T ASIAN LOW-RESO 2022. [DOI: 10.1145/3485468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
In this article, we address the problem of localizing text and symbolic annotations on the scanned image of a printed document. Previous approaches have considered the task of annotation extraction as binary classification into printed and handwritten text. In this work, we further subcategorize the annotations as underlines, encirclements, inline text, and marginal text. We have collected a new dataset of 300 documents constituting all classes of annotations marked around or in-between printed text. Using the dataset as a benchmark, we report the results of two saliency formulations—CRF Saliency and Discriminant Saliency, for predicting salient patches, which can correspond to different types of annotations. We also compare our work with recent semantic segmentation techniques using deep models. Our analysis shows that Discriminant Saliency can be considered as the preferred approach for fast localization of patches containing different types of annotations. The saliency models were learned on a small dataset, but still, give comparable performance to the deep networks for pixel-level semantic segmentation. We show that saliency-based methods give better outcomes with limited annotated data compared to more sophisticated segmentation techniques that require a large training set to learn the model.
Collapse
Affiliation(s)
- Shilpa Pandey
- Adani Institute of Infrastructure Engineering, Ahmedabad, Gujarat, India
| | - Gaurav Harit
- Indian Institute of Technology Jodhpur, Jodhpur, Rajasthan, India
| |
Collapse
|
6
|
A Saliency Prediction Model Based on Re-Parameterization and Channel Attention Mechanism. ELECTRONICS 2022. [DOI: 10.3390/electronics11081180] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Deep saliency models can effectively imitate the attention mechanism of human vision, and they perform considerably better than classical models that rely on handcrafted features. However, deep models also require higher-level information, such as context or emotional content, to further approach human performance. Therefore, this study proposes a multilevel saliency prediction network that aims to use a combination of spatial and channel information to find possible high-level features, further improving the performance of a saliency model. Firstly, we use a VGG style network with an identity block as the primary network architecture. With the help of re-parameterization, we can obtain rich features similar to multiscale networks and effectively reduce computational cost. Secondly, a subnetwork with a channel attention mechanism is designed to find potential saliency regions and possible high-level semantic information in an image. Finally, image spatial features and a channel enhancement vector are combined after quantization to improve the overall performance of the model. Compared with classical models and other deep models, our model exhibits superior overall performance.
Collapse
|
7
|
Ji W, Yan G, Li J, Piao Y, Yao S, Zhang M, Cheng L, Lu H. DMRA: Depth-Induced Multi-Scale Recurrent Attention Network for RGB-D Saliency Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2321-2336. [PMID: 35245195 DOI: 10.1109/tip.2022.3154931] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this work, we propose a novel depth-induced multi-scale recurrent attention network for RGB-D saliency detection, named as DMRA. It achieves dramatic performance especially in complex scenarios. There are four main contributions of our network that are experimentally demonstrated to have significant practical merits. First, we design an effective depth refinement block using residual connections to fully extract and fuse cross-modal complementary cues from RGB and depth streams. Second, depth cues with abundant spatial information are innovatively combined with multi-scale contextual features for accurately locating salient objects. Third, a novel recurrent attention module inspired by Internal Generative Mechanism of human brain is designed to generate more accurate saliency results via comprehensively learning the internal semantic relation of the fused feature and progressively optimizing local details with memory-oriented scene understanding. Finally, a cascaded hierarchical feature fusion strategy is designed to promote efficient information interaction of multi-level contextual features and further improve the contextual representability of model. In addition, we introduce a new real-life RGB-D saliency dataset containing a variety of complex scenarios that has been widely used as a benchmark dataset in recent RGB-D saliency detection research. Extensive empirical experiments demonstrate that our method can accurately identify salient objects and achieve appealing performance against 18 state-of-the-art RGB-D saliency models on nine benchmark datasets.
Collapse
|
8
|
Diab MS, Elhosseini MA, El-Sayed MS, Ali HA. Brain Strategy Algorithm for Multiple Object Tracking Based on Merging Semantic Attributes and Appearance Features. SENSORS 2021; 21:s21227604. [PMID: 34833680 PMCID: PMC8625767 DOI: 10.3390/s21227604] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 11/10/2021] [Accepted: 11/11/2021] [Indexed: 11/16/2022]
Abstract
The human brain can effortlessly perform vision processes using the visual system, which helps solve multi-object tracking (MOT) problems. However, few algorithms simulate human strategies for solving MOT. Therefore, devising a method that simulates human activity in vision has become a good choice for improving MOT results, especially occlusion. Eight brain strategies have been studied from a cognitive perspective and imitated to build a novel algorithm. Two of these strategies gave our algorithm novel and outstanding results, rescuing saccades and stimulus attributes. First, rescue saccades were imitated by detecting the occlusion state in each frame, representing the critical situation that the human brain saccades toward. Then, stimulus attributes were mimicked by using semantic attributes to reidentify the person in these occlusion states. Our algorithm favourably performs on the MOT17 dataset compared to state-of-the-art trackers. In addition, we created a new dataset of 40,000 images, 190,000 annotations and 4 classes to train the detection model to detect occlusion and semantic attributes. The experimental results demonstrate that our new dataset achieves an outstanding performance on the scaled YOLOv4 detection model by achieving a 0.89 mAP 0.5.
Collapse
Affiliation(s)
- Mai S. Diab
- Faculty of Computer & Artificial Intelligence, Benha University, Benha 13511, Egypt;
- Intoolab Ltd., London WC2H 9JQ, UK
- Correspondence:
| | - Mostafa A. Elhosseini
- Computers Engineering and Control System, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt; (M.A.E.); (H.A.A.)
- College of Computer Science and Engineering in Yanbu, Taibah University, Madinah 46421, Saudi Arabia
| | - Mohamed S. El-Sayed
- Faculty of Computer & Artificial Intelligence, Benha University, Benha 13511, Egypt;
| | - Hesham A. Ali
- Computers Engineering and Control System, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt; (M.A.E.); (H.A.A.)
- Faculty of Artificial Intelligence, Delta University for Science and Technology, Mansoura 35511, Egypt
| |
Collapse
|
9
|
Song D, Dong Y, Li X. Hierarchical Edge Refinement Network for Saliency Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:7567-7577. [PMID: 34464260 DOI: 10.1109/tip.2021.3106798] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
At present, most saliency detection methods are based on fully convolutional neural networks (FCNs). However, FCNs usually blur the edges of salient objects. Due to that, the multiple convolution and pooling operations of the FCNs will limit the spatial resolution of the feature maps. To alleviate this issue and obtain accurate edges, we propose a hierarchical edge refinement network (HERNet) for accurate saliency detection. In detail, the HERNet is mainly composed of a saliency prediction network and an edge preserving network. Firstly, the saliency prediction network is used to roughly detect the regions of salient objects and is based on a modified U-Net structure. Then, the edge preserving network is used to accurately detect the edges of salient objects, and this network is mainly composed of the atrous spatial pyramid pooling (ASPP) module. Different from the previous indiscriminate supervision strategy, we adopt a new one-to-one hierarchical supervision strategy to supervise the different outputs of the entire network. Experimental results on five traditional benchmark datasets demonstrate that the proposed HERNet performs well when compared with the state-of-the-art methods.
Collapse
|
10
|
Pal SK, Pramanik A, Maiti J, Mitra P. Deep learning in multi-object detection and tracking: state of the art. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02293-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
11
|
Morales A, Costela FM, Woods RL. Saccade Landing Point Prediction Based on Fine-Grained Learning Method. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2021; 9:52474-52484. [PMID: 33981520 PMCID: PMC8112574 DOI: 10.1109/access.2021.3070511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The landing point of a saccade defines the new fixation region, the new region of interest. We asked whether it was possible to predict the saccade landing point early in this very fast eye movement. This work proposes a new algorithm based on LSTM networks and a fine-grained loss function for saccade landing point prediction in real-world scenarios. Predicting the landing point is a critical milestone toward reducing the problems caused by display-update latency in gaze-contingent systems that make real-time changes in the display based on eye tracking. Saccadic eye movements are some of the fastest human neuro-motor activities with angular velocities of up to 1,000°/s. We present a comprehensive analysis of the performance of our method using a database with almost 220,000 saccades from 75 participants captured during natural viewing of videos. We include a comparison with state-of-the-art saccade landing point prediction algorithms. The results obtained using our proposed method outperformed existing approaches with improvements of up to 50% error reduction. Finally, we analyzed some factors that affected prediction errors including duration, length, age, and user intrinsic characteristics.
Collapse
Affiliation(s)
- Aythami Morales
- BiDA-Lab, Department of Electrical Engineering, Universidad Autonoma de Madrid, 28049 Madrid, Spain
- Schepens Eye Research Institute, Massachusetts Eye and Ear, Boston, MA 02114, USA
| | - Francisco M Costela
- Schepens Eye Research Institute, Massachusetts Eye and Ear, Boston, MA 02114, USA
- Department of Ophthalmology, Harvard Medical School, Boston, MA 02115, USA
| | - Russell L Woods
- Schepens Eye Research Institute, Massachusetts Eye and Ear, Boston, MA 02114, USA
- Department of Ophthalmology, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
12
|
Wang W, Shen J, Xie J, Cheng MM, Ling H, Borji A. Revisiting Video Saliency Prediction in the Deep Learning Era. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:220-237. [PMID: 31247542 DOI: 10.1109/tpami.2019.2924417] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.
Collapse
|
13
|
Zhang Q, Cui W, Shi Y, Zhang X, Liu Y. Attentive feature integration network for detecting salient objects in images. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.05.083] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
14
|
Recent Advances in Saliency Estimation for Omnidirectional Images, Image Groups, and Video Sequences. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10155143] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
We present a review of methods for automatic estimation of visual saliency: the perceptual property that makes specific elements in a scene stand out and grab the attention of the viewer. We focus on domains that are especially recent and relevant, as they make saliency estimation particularly useful and/or effective: omnidirectional images, image groups for co-saliency, and video sequences. For each domain, we perform a selection of recent methods, we highlight their commonalities and differences, and describe their unique approaches. We also report and analyze the datasets involved in the development of such methods, in order to reveal additional peculiarities of each domain, such as the representation used for the ground truth saliency information (scanpaths, saliency maps, or salient object regions). We define domain-specific evaluation measures, and provide quantitative comparisons on the basis of common datasets and evaluation criteria, highlighting the different impact of existing approaches on each domain. We conclude by synthesizing the emerging directions for research in the specialized literature, which include novel representations for omnidirectional images, inter- and intra- image saliency decomposition for co-saliency, and saliency shift for video saliency estimation.
Collapse
|
15
|
Ghariba BM, Shehata MS, McGuire P. A novel fully convolutional network for visual saliency prediction. PeerJ Comput Sci 2020; 6:e280. [PMID: 33816931 PMCID: PMC7924520 DOI: 10.7717/peerj-cs.280] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Accepted: 06/07/2020] [Indexed: 06/12/2023]
Abstract
A human Visual System (HVS) has the ability to pay visual attention, which is one of the many functions of the HVS. Despite the many advancements being made in visual saliency prediction, there continues to be room for improvement. Deep learning has recently been used to deal with this task. This study proposes a novel deep learning model based on a Fully Convolutional Network (FCN) architecture. The proposed model is trained in an end-to-end style and designed to predict visual saliency. The entire proposed model is fully training style from scratch to extract distinguishing features. The proposed model is evaluated using several benchmark datasets, such as MIT300, MIT1003, TORONTO, and DUT-OMRON. The quantitative and qualitative experiment analyses demonstrate that the proposed model achieves superior performance for predicting visual saliency.
Collapse
Affiliation(s)
- Bashir Muftah Ghariba
- Faculty of Engineering & Applied Science, Memorial University of Newfoundland, St. John’s, NL, Canada
- Department of Electrical and Computer Engineering, Faculty of Engineering, Elmergib University, Khoms, Libya
| | - Mohamed S. Shehata
- Department of Computer Science, Mathematics, Physics and Statistics, University of British Columbia, Kelowna, BC, Canada
| | | |
Collapse
|
16
|
|
17
|
Jia F, Wang X, Guan J, Liao Q, Zhang J, Li H, Qi S. Bi-Connect Net for salient object detection. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.12.020] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
18
|
Zhang Q, Huang N, Yao L, Zhang D, Shan C, Han J. RGB-T Salient Object Detection via Fusing Multi-level CNN Features. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:3321-3335. [PMID: 31869791 DOI: 10.1109/tip.2019.2959253] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RGB-induced salient object detection has recently witnessed substantial progress, which is attributed to the superior feature learning capability of deep convolutional neural networks (CNNs). However, such detections suffer from challenging scenarios characterized by cluttered backgrounds, low-light conditions and variations in illumination. Instead of improving RGB based saliency detection, this paper takes advantage of the complementary benefits of RGB and thermal infrared images. Specifically, we propose a novel end-to-end network for multi-modal salient object detection, which turns the challenge of RGB-T saliency detection to a CNN feature fusion problem. To this end, a backbone network (e.g., VGG-16) is first adopted to extract the coarse features from each RGB or thermal infrared image individually, and then several adjacent-depth feature combination (ADFC) modules are designed to extract multi-level refined features for each single-modal input image, considering that features captured at different depths differ in semantic information and visual details. Subsequently, a multi-branch group fusion (MGF) module is employed to capture the cross-modal features by fusing those features from ADFC modules for a RGB-T image pair at each level. Finally, a joint attention guided bi-directional message passing (JABMP) module undertakes the task of saliency prediction via integrating the multi-level fused features from MGF modules. Experimental results on several public RGB-T salient object detection datasets demonstrate the superiorities of our proposed algorithm over the state-of-the-art approaches, especially under challenging conditions, such as poor illumination, complex background and low contrast.
Collapse
|
19
|
van der Merwe JR, Rügamer A, Felber W. Blind Spoofing GNSS Constellation Detection Using a Multi-Antenna Snapshot Receiver. SENSORS (BASEL, SWITZERLAND) 2019; 19:s19245439. [PMID: 31835503 PMCID: PMC6960917 DOI: 10.3390/s19245439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 11/27/2019] [Accepted: 12/05/2019] [Indexed: 06/10/2023]
Abstract
Spoofing of global navigation satellite system (GNSS) signals threatens positioning systems. A counter-method is to detect the presence of spoofed signals, followed by a warning to the user. In this paper, a multi-antenna snapshot receiver is presented to detect the presence of a spoofing attack. The spatial similarities of the array steering vectors are analyzed, and different metrics are used to establish possible detector functions. These include subset methods, Eigen-decomposition, and clustering algorithms. The results generated within controlled spoofing conditions show that a spoofed constellation of GNSS satellites can be successfully detected. The derived system-level detectors increase performance in comparison to pair-wise methods. A controlled test setup achieved perfect detection; however, in real-world cases, the performance would not be as ideal. Some detection metrics and features for blind spoofing detecting, with an array of antennas, are identified, which opens the field for future advanced multi-detector developments.
Collapse
|
20
|
Assessment of feature fusion strategies in visual attention mechanism for saliency detection. Pattern Recognit Lett 2019. [DOI: 10.1016/j.patrec.2018.08.022] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
21
|
Zhao ZQ, Zheng P, Xu ST, Wu X. Object Detection With Deep Learning: A Review. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:3212-3232. [PMID: 30703038 DOI: 10.1109/tnnls.2018.2876865] [Citation(s) in RCA: 836] [Impact Index Per Article: 139.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles that combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy, and optimization function. In this paper, we provide a review of deep learning-based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely, the convolutional neural network. Then, we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection, and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network-based learning systems.
Collapse
|
22
|
Piao Y, Li X, Zhang M, Yu J, Lu H. Saliency Detection via Depth-induced Cellular Automata on Light Field. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:1879-1889. [PMID: 31613755 DOI: 10.1109/tip.2019.2942434] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Incorrect saliency detection such as false alarms and missed alarms may lead to potentially severe consequences in various application areas. Effective separation of salient objects in complex scenes is a major challenge in saliency detection. In this paper, we propose a new method for saliency detection on light field to improve the saliency detection in challenging scenes. We construct an object-guided depth map, which acts as an inducer to efficiently incorporate the relations among light field cues, by using abundant light field cues. Furthermore, we enforce spatial consistency by constructing an optimization model, named Depth-induced Cellular Automata (DCA), in which the saliency value of each superpixel is updated by exploiting the intrinsic relevance of its similar regions. Additionally, the proposed DCA model enables inaccurate saliency maps to achieve a high level of accuracy. We analyze our approach on one publicly available dataset. Experiments show the proposed method is robust to a wide range of challenging scenes and outperforms the state-of-the-art 2D/3D/4D (light-field) saliency detection approaches.
Collapse
|
23
|
Liu Y, Han J, Zhang Q, Shan C. Deep Salient Object Detection with Contextual Information Guidance. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:360-374. [PMID: 31380760 DOI: 10.1109/tip.2019.2930906] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Integration of multi-level contextual information, such as feature maps and side outputs, is crucial for Convolutional Neural Networks (CNNs) based salient object detection. However, most existing methods either simply concatenate multi-level feature maps or calculate element-wise addition of multi-level side outputs, thus failing to take full advantages of them. In this work, we propose a new strategy for guiding multi-level contextual information integration, where feature maps and side outputs across layers are fully engaged. Specifically, shallower-level feature maps are guided by the deeper-level side outputs to learn more accurate properties of the salient object. In turn, the deeper-level side outputs can be propagated to high-resolution versions with spatial details complemented by means of shallower-level feature maps. Moreover, a group convolution module is proposed with the aim to achieve high-discriminative feature maps, in which the backbone feature maps are divided into a number of groups and then the convolution is applied to the channels of backbone feature maps within each group. Eventually, the group convolution module is incorporated in the guidance module to further promote the guidance role. Experiments on three public benchmark datasets verify the effectiveness and superiority of the proposed method over the state-of-the-art methods.
Collapse
|
24
|
Zhang D, Zakir A. Top–Down Saliency Detection Based on Deep-Learned Features. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2019. [DOI: 10.1142/s1469026819500093] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
How to localize objects in images accurately and efficiently is a challenging problem in computer vision. In this paper, a novel top–down fine-grained salient object detection method based on deep-learned features is proposed, which can detect the same object in input image as the query image. The query image and its three subsample images are used as top–down cues to guide saliency detection. We ameliorate convolutional neural network (CNN) using the fast VGG network (VGG-f) pre-trained on ImageNet and re-trained on the Pascal VOC 2012 dataset. Experiment on the FiFA dataset demonstrates that proposed method can localize the saliency region and find the specific object (e.g., human face) as the query. Experiments on the David1 and Face1 sequences conclusively prove that the proposed algorithm is able to effectively deal with many challenging factors including illumination change, shape deformation, scale change and partial occlusion.
Collapse
Affiliation(s)
- Duzhen Zhang
- School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, Jiangsu, P. R. China
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, Jiangsu, P. R. China
| | - Ali Zakir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, Jiangsu, P. R. China
| |
Collapse
|
25
|
Xia C, Han J, Qi F, Shi G. Predicting Human Saccadic Scanpaths Based on Iterative Representation Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:3502-3515. [PMID: 30735998 DOI: 10.1109/tip.2019.2897966] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Visual attention is a dynamic process of scene exploration and information acquisition. However, existing research on attention modeling has concentrated on estimating static salient locations. In contrast, dynamic attributes presented by saccade have not been well explored in previous attention models. In this paper, we address the problem of saccadic scanpath prediction by introducing an iterative representation learning framework. Within the framework, saccade can be interpreted as an iterative process of predicting one fixation according to the current representation and updating the representation based on the gaze shift. In the predicting phase, we propose a Bayesian definition of saccade to combine the influence of perceptual residual and spatial location on the selection of fixations. In implementation, we compute the representation error of an autoencoder-based network to measure perceptual residuals of each area. Simultaneously, we integrate saccade amplitude and center-weighted mechanism to model the influence of spatial location. Based on estimating the influence of two parts, the final fixation is defined as the point with the largest posterior probability of gaze shift. In the updating phase, we update the representation pattern for the subsequent calculation by retraining the network with samples extracted around the current fixation. In the experiments, the proposed model can replicate the fundamental properties of psychophysics in visual search. In addition, it can achieve superior performance on several benchmark eye-tracking data sets.
Collapse
|
26
|
Zhu C, Zhang W, Li TH, Liu S, Li G. Exploiting the Value of the Center-dark Channel Prior for Salient Object Detection. ACM T INTEL SYST TEC 2019. [DOI: 10.1145/3319368] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Saliency detection aims to detect the most attractive objects in images and is widely used as a foundation for various applications. In this article, we propose a novel salient object detection algorithm for RGB-D images using center-dark channel priors. First, we generate an initial saliency map based on a color saliency map and a depth saliency map of a given RGB-D image. Then, we generate a center-dark channel map based on center saliency and dark channel priors. Finally, we fuse the initial saliency map with the center dark channel map to generate the final saliency map. Extensive evaluations over four benchmark datasets demonstrate that our proposed method performs favorably against most of the state-of-the-art approaches. Besides, we further discuss the application of the proposed algorithm in small target detection and demonstrate the universal value of center-dark channel priors in the field of object detection.
Collapse
Affiliation(s)
- Chunbiao Zhu
- Shenzhen Graduate School, Peking University, Shenzhen, Guangdong, China
| | - Wenhao Zhang
- Shenzhen Graduate School, Peking University, Shenzhen, Guangdong, China
| | - Thomas H. Li
- Shenzhen Graduate School, China and Advanced Institute of Information Technology, Peking University, China
| | | | - Ge Li
- Shenzhen Graduate School, Peking University, Shenzhen, Guangdong, China
| |
Collapse
|
27
|
Visual Saliency Detection Using a Rule-Based Aggregation Approach. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9102015] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In this paper, we propose an approach for salient pixel detection using a rule-based system. In our proposal, rules are automatically learned by combining four saliency models. The learned rules are utilized for the detection of pixels of the salient object in a visual scene. The proposed methodology consists of two main stages. Firstly, in the training stage, the knowledge extracted from outputs of four state-of-the-art saliency models is used to induce an ensemble of rough-set-based rules. Secondly, the induced rules are utilized by our system to determine, in a binary manner, the pixels corresponding to the salient object within a scene. Being independent of any threshold value, such a method eliminates any midway uncertainty and exempts us from performing a post-processing step as is required in most approaches to saliency detection. The experimental results on three datasets show that our method obtains stable and better results than state-of-the-art models. Moreover, it can be used as a pre-processing stage in computer vision-based applications in diverse areas such as robotics, image segmentation, marketing, and image compression.
Collapse
|
28
|
Guo F, Wang W, Shen J, Shao L, Yang J, Tao D, Tang YY. Video Saliency Detection Using Object Proposals. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:3159-3170. [PMID: 29990032 DOI: 10.1109/tcyb.2017.2761361] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In this paper, we introduce a novel approach to identify salient object regions in videos via object proposals. The core idea is to solve the saliency detection problem by ranking and selecting the salient proposals based on object-level saliency cues. Object proposals offer a more complete and high-level representation, which naturally caters to the needs of salient object detection. As well as introducing this novel solution for video salient object detection, we reorganize various discriminative saliency cues and traditional saliency assumptions on object proposals. With object candidates, a proposal ranking and voting scheme, based on various object-level saliency cues, is designed to screen out nonsalient parts, select salient object regions, and to infer an initial saliency estimate. Then a saliency optimization process that considers temporal consistency and appearance differences between salient and nonsalient regions is used to refine the initial saliency estimates. Our experiments on public datasets (SegTrackV2, Freiburg-Berkeley Motion Segmentation Dataset, and Densely Annotated Video Segmentation) validate the effectiveness, and the proposed method produces significant improvements over state-of-the-art algorithms.
Collapse
|
29
|
Cholakkal H, Johnson J, Rajan D. Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:6064-6078. [PMID: 30106724 DOI: 10.1109/tip.2018.2864891] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Top-down saliency models produce a probability map that peaks at target locations specified by a task/goal such as object detection. They are usually trained in a fully supervised setting involving pixel-level annotations of objects. We propose a weakly supervised top-down saliency framework using only binary labels that indicate the presence/absence of an object in an image. First, the probabilistic contribution of each image region to the confidence of a CNN-based image classifier is computed through a backtracking strategy to produce top-down saliency. From a set of saliency maps of an image produced by fast bottom-up saliency approaches, we select the best saliency map suitable for the top-down task. The selected bottom-up saliency map is combined with the top-down saliency map. Features having high combined saliency are used to train a linear SVM classifier to estimate feature saliency. This is integrated with combined saliency and further refined through a multi-scale superpixel-averaging of saliency map. We evaluate the performance of the proposed weakly supervised topdown saliency and achieve comparable performance with fully supervised approaches. Experiments are carried out on seven challenging datasets and quantitative results are compared with 40 closely related approaches across 4 different applications. Code will be made publicly available.
Collapse
|
30
|
Wang W, Shen J. Deep Visual Attention Prediction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:2368-2378. [PMID: 29990140 DOI: 10.1109/tip.2017.2787612] [Citation(s) in RCA: 192] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
In this paper, we aim to predict human eye fixation with view-free scenes based on an end-to-end deep learning architecture. Although convolutional neural networks (CNNs) have made substantial improvement on human attention prediction, it is still needed to improve the CNN-based attention models by efficiently leveraging multi-scale features. Our visual attention network is proposed to capture hierarchical saliency information from deep, coarse layers with global saliency information to shallow, fine layers with local saliency response. Our model is based on a skip-layer network structure, which predicts human attention from multiple convolutional layers with various reception fields. Final saliency prediction is achieved via the cooperation of those global and local predictions. Our model is learned in a deep supervision manner, where supervision is directly fed into multi-level layers, instead of previous approaches of providing supervision only at the output layer and propagating this supervision back to earlier layers. Our model thus incorporates multi-level saliency predictions within a single network, which significantly decreases the redundancy of previous approaches of learning multiple network streams with different input scales. Extensive experimental analysis on various challenging benchmark data sets demonstrate our method yields the state-of-the-art performance with competitive inference time.
Collapse
|
31
|
Rahman IMH, Hollitt C, Zhang M. Feature Map Quality Score Estimation Through Regression. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:1793-1808. [PMID: 29346095 DOI: 10.1109/tip.2017.2785623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Understanding the visual quality of a feature map plays a significant role in many active vision applications. Previous works mostly rely on object-level features, such as compactness, to estimate the quality score of a feature map. However, the compactness is leveraged on feature maps produced by salient object detection techniques where the maps tend to be compact. As a result, the compactness feature fails when the feature maps are blurry (e.g., fixation maps). In this paper, we regard the process of estimating the quality score of feature maps, specifically fixation maps, as a regression problem. After extracting several local, global, geometric, and positional characteristic features from a feature map, a model is learned using a random forest regressor to estimate the quality score of any unseen feature map. Our model is specifically tailored to estimate the quality of three types of maps: bottom-up, target, and contextual feature maps. These maps are produced for a large benchmark fixation data set of more than 900 challenging outdoor images. We demonstrate that our approach provides an accurate estimate of the quality of the abovementioned feature maps compared to the groundtruth data. In addition, we show that our proposed approach is useful in feature map integration for predicting human fixation. Instead of naively integrating all three feature maps when predicting human fixation, our proposed approach dynamically selects the best feature map with the highest estimated quality score on an individual image basis, thereby improving the fixation prediction accuracy.
Collapse
|
32
|
Ishikura K, Kurita N, Chandler DM, Ohashi G. Saliency Detection Based on Multiscale Extrema of Local Perceptual Color Differences. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:703-717. [PMID: 29185988 DOI: 10.1109/tip.2017.2767288] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Visual saliency detection is a useful technique for predicting, which regions humans will tend to gaze upon in any given image. Over the last several decades, numerous algorithms for automatic saliency detection have been proposed and shown to work well on both synthetic and natural images. However, two key challenges remain largely unaddressed: 1) How to improve the relatively low predictive performance for images that contain large objects and 2) how to perform saliency detection on a wider variety of images from various categories without training. In this paper, we propose a new saliency detection algorithm that addresses these challenges. Our model first detects potentially salient regions based on multiscale extrema of local perceived color differences measured in the CIELAB color space. These extrema are highly effective for estimating the locations, sizes, and saliency levels of candidate regions. The local saliency candidates are further refined via two global extrema-based features, and then a Gaussian mixture is used to generate the final saliency map. Experimental validation on the extensive CAT2000 data set demonstrates that our proposed method either outperforms or is highly competitive with prior approaches, and can perform well across different categories and object sizes, while remaining training-free.
Collapse
|
33
|
Chen Z, Zhou W, Li W. Blind Stereoscopic Video Quality Assessment: From Depth Perception to Overall Experience. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:721-734. [PMID: 29185989 DOI: 10.1109/tip.2017.2766780] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Stereoscopic video quality assessment (SVQA) is a challenging problem. It has not been well investigated on how to measure depth perception quality independently under different distortion categories and degrees, especially exploit the depth perception to assist the overall quality assessment of 3D videos. In this paper, we propose a new depth perception quality metric (DPQM) and verify that it outperforms existing metrics on our published 3D video extension of High Efficiency Video Coding (3D-HEVC) video database. Furthermore, we validate its effectiveness by applying the crucial part of the DPQM to a novel blind stereoscopic video quality evaluator (BSVQE) for overall 3D video quality assessment. In the DPQM, we introduce the feature of auto-regressive prediction-based disparity entropy (ARDE) measurement and the feature of energy weighted video content measurement, which are inspired by the free-energy principle and the binocular vision mechanism. In the BSVQE, the binocular summation and difference operations are integrated together with the fusion natural scene statistic measurement and the ARDE measurement to reveal the key influence from texture and disparity. Experimental results on three stereoscopic video databases demonstrate that our method outperforms state-of-the-art SVQA algorithms for both symmetrically and asymmetrically distorted stereoscopic video pairs of various distortion types.
Collapse
|
34
|
Zhu Q, Triesch J, Shi BE. Joint Learning of Binocularly Driven Saccades and Vergence by Active Efficient Coding. Front Neurorobot 2017; 11:58. [PMID: 29163121 PMCID: PMC5675843 DOI: 10.3389/fnbot.2017.00058] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Accepted: 10/13/2017] [Indexed: 11/25/2022] Open
Abstract
This paper investigates two types of eye movements: vergence and saccades. Vergence eye movements are responsible for bringing the images of the two eyes into correspondence, whereas saccades drive gaze to interesting regions in the scene. Control of both vergence and saccades develops during early infancy. To date, these two types of eye movements have been studied separately. Here, we propose a computational model of an active vision system that integrates these two types of eye movements. We hypothesize that incorporating a saccade strategy driven by bottom-up attention will benefit the development of vergence control. The integrated system is based on the active efficient coding framework, which describes the joint development of sensory-processing and eye movement control to jointly optimize the coding efficiency of the sensory system. In the integrated system, we propose a binocular saliency model to drive saccades based on learned binocular feature extractors, which simultaneously encode both depth and texture information. Saliency in our model also depends on the current fixation point. This extends prior work, which focused on monocular images and saliency measures that are independent of the current fixation. Our results show that the proposed saliency-driven saccades lead to better vergence performance and faster learning in the overall system than random saccades. Faster learning is significant because it indicates that the system actively selects inputs for the most effective learning. This work suggests that saliency-driven saccades provide a scaffold for the development of vergence control during infancy.
Collapse
Affiliation(s)
- Qingpeng Zhu
- Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
| | - Jochen Triesch
- Frankfurt Institute for Advanced Studies, Frankfurt am Main, Germany
| | - Bertram E Shi
- Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
| |
Collapse
|
35
|
|
36
|
Zhang YY, Yang C, Zhang P. Reprint of “Two-stage sparse coding of region covariance via Log-Euclidean kernels to detect saliency”. Neural Netw 2017; 92:47-59. [DOI: 10.1016/j.neunet.2017.06.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Revised: 02/05/2017] [Accepted: 02/13/2017] [Indexed: 10/19/2022]
|
37
|
Li N, Ye J, Ji Y, Ling H, Yu J. Saliency Detection on Light Field. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2017; 39:1605-1616. [PMID: 27654139 DOI: 10.1109/tpami.2016.2610425] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Existing saliency detection approaches use images as inputs and are sensitive to foreground/background similarities, complex background textures, and occlusions. We explore the problem of using light fields as input for saliency detection. Our technique is enabled by the availability of commercial plenoptic cameras that capture the light field of a scene in a single shot. We show that the unique refocusing capability of light fields provides useful focusness, depths, and objectness cues. We further develop a new saliency detection algorithm tailored for light fields. To validate our approach, we acquire a light field database of a range of indoor and outdoor scenes and generate the ground truth saliency map. Experiments show that our saliency detection scheme can robustly handle challenging scenarios such as similar foreground and background, cluttered background, complex occlusions, etc., and achieve high accuracy and robustness.
Collapse
|
38
|
Amor TA, Luković M, Herrmann HJ, Andrade JS. Influence of scene structure and content on visual search strategies. J R Soc Interface 2017; 14:rsif.2017.0406. [PMID: 28747401 DOI: 10.1098/rsif.2017.0406] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Accepted: 06/30/2017] [Indexed: 11/12/2022] Open
Abstract
When searching for a target within an image, our brain can adopt different strategies, but which one does it choose? This question can be answered by tracking the motion of the eye while it executes the task. Following many individuals performing various search tasks, we distinguish between two competing strategies. Motivated by these findings, we introduce a model that captures the interplay of the search strategies and allows us to create artificial eye-tracking trajectories, which could be compared with the experimental ones. Identifying the model parameters allows us to quantify the strategy employed in terms of ensemble averages, characterizing each experimental cohort. In this way, we can discern with high sensitivity the relation between the visual landscape and the average strategy, disclosing how small variations in the image induce changes in the strategy.
Collapse
Affiliation(s)
- Tatiana A Amor
- Computational Physics IfB, ETH Zurich, Stefano-Franscini-Platz 3, 8093, Zurich, Switzerland.,Departamento de Física, Universidade Federal do Ceará, 60451-970, Fortaleza, Ceará, Brazil
| | - Mirko Luković
- Computational Physics IfB, ETH Zurich, Stefano-Franscini-Platz 3, 8093, Zurich, Switzerland
| | - Hans J Herrmann
- Computational Physics IfB, ETH Zurich, Stefano-Franscini-Platz 3, 8093, Zurich, Switzerland.,Departamento de Física, Universidade Federal do Ceará, 60451-970, Fortaleza, Ceará, Brazil
| | - José S Andrade
- Departamento de Física, Universidade Federal do Ceará, 60451-970, Fortaleza, Ceará, Brazil
| |
Collapse
|
39
|
R. Tavakoli H, Borji A, Laaksonen J, Rahtu E. Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2017.03.018] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
40
|
Zhang YY, Yang C, Zhang P. Two-stage sparse coding of region covariance via Log-Euclidean kernels to detect saliency. Neural Netw 2017; 89:84-96. [PMID: 28365298 DOI: 10.1016/j.neunet.2017.02.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Revised: 02/05/2017] [Accepted: 02/13/2017] [Indexed: 11/30/2022]
Abstract
In this paper, we present a novel bottom-up saliency detection algorithm from the perspective of covariance matrices on a Riemannian manifold. Each superpixel is described by a region covariance matrix on Riemannian Manifolds. We carry out a two-stage sparse coding scheme via Log-Euclidean kernels to extract salient objects efficiently. In the first stage, given background dictionary on image borders, sparse coding of each region covariance via Log-Euclidean kernels is performed. The reconstruction error on the background dictionary is regarded as the initial saliency of each superpixel. In the second stage, an improvement of the initial result is achieved by calculating reconstruction errors of the superpixels on foreground dictionary, which is extracted from the first stage saliency map. The sparse coding in the second stage is similar to the first stage, but is able to effectively highlight the salient objects uniformly from the background. Finally, three post-processing methods-highlight-inhibition function, context-based saliency weighting, and the graph cut-are adopted to further refine the saliency map. Experiments on four public benchmark datasets show that the proposed algorithm outperforms the state-of-the-art methods in terms of precision, recall and mean absolute error, and demonstrate the robustness and efficiency of the proposed method.
Collapse
Affiliation(s)
- Ying-Ying Zhang
- Physics & Electronic Engineering College, Nanyang Normal University, Nanyang 473061, People's Republic of China.
| | - Cai Yang
- Computer & Information Technology College, Nanyang Normal University, Nanyang 473061, People's Republic of China
| | - Ping Zhang
- Physics & Electronic Engineering College, Nanyang Normal University, Nanyang 473061, People's Republic of China
| |
Collapse
|
41
|
Yang J, Yang MH. Top-Down Visual Saliency via Joint CRF and Dictionary Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2017; 39:576-588. [PMID: 28113265 DOI: 10.1109/tpami.2016.2547384] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Top-down visual saliency is an important module of visual attention. In this work, we propose a novel top-down saliency model that jointly learns a Conditional Random Field (CRF) and a visual dictionary. The proposed model incorporates a layered structure from top to bottom: CRF, sparse coding and image patches. With sparse coding as an intermediate layer, CRF is learned in a feature-adaptive manner; meanwhile with CRF as the output layer, the dictionary is learned under structured supervision. For efficient and effective joint learning, we develop a max-margin approach via a stochastic gradient descent algorithm. Experimental results on the Graz-02 and PASCAL VOC datasets show that our model performs favorably against state-of-the-art top-down saliency methods for target object localization. In addition, the dictionary update significantly improves the performance of our model. We demonstrate the merits of the proposed top-down saliency model by applying it to prioritizing object proposals for detection and predicting human fixations.
Collapse
|
42
|
Xu M, Jiang L, Sun X, Ye Z, Wang Z. Learning to Detect Video Saliency With HEVC Features. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2017; 26:369-385. [PMID: 28113934 DOI: 10.1109/tip.2016.2628583] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Saliency detection has been widely studied to predict human fixations, with various applications in computer vision and image processing. For saliency detection, we argue in this paper that the state-of-the-art High Efficiency Video Coding (HEVC) standard can be used to generate the useful features in compressed domain. Therefore, this paper proposes to learn the video saliency model, with regard to HEVC features. First, we establish an eye tracking database for video saliency detection, which can be downloaded from https://github.com/remega/video_database. Through the statistical analysis on our eye tracking database, we find out that human fixations tend to fall into the regions with large-valued HEVC features on splitting depth, bit allocation, and motion vector (MV). In addition, three observations are obtained with the further analysis on our eye tracking database. Accordingly, several features in HEVC domain are proposed on the basis of splitting depth, bit allocation, and MV. Next, a kind of support vector machine is learned to integrate those HEVC features together, for video saliency detection. Since almost all video data are stored in the compressed form, our method is able to avoid both the computational cost on decoding and the storage cost on raw data. More importantly, experimental results show that the proposed method is superior to other state-of-the-art saliency detection methods, either in compressed or uncompressed domain.
Collapse
|
43
|
Wang Z, Xu G, Wang Z, Zhu C. Saliency detection integrating both background and foreground information. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.07.051] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
44
|
Liu D, Chang F, Liu C. Salient object detection fusing global and local information based on nonsubsampled contourlet transform. JOURNAL OF THE OPTICAL SOCIETY OF AMERICA. A, OPTICS, IMAGE SCIENCE, AND VISION 2016; 33:1430-1441. [PMID: 27505640 DOI: 10.1364/josaa.33.001430] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The nonsubsampled contourlet transform (NSCT) has properties of multiresolution, localization, directionality, and anisotropy. The directionality property permits it to resolve intrinsic directional features that characterize the analyzed image. In this paper, we present a bottom-up salient object detection approach fusing global and local information based on NSCT. Images are first decomposed by applying NSCT. The coefficients of bandpass subbands are categorized and optimized accordingly to get better representation. Then feature maps are obtained by performing the inverse NSCT on these optimized coefficients. The global and local saliency maps are generated from these feature maps. Global saliency is obtained by utilizing the likelihood of features, and local saliency is measured by calculating the local self-information. In the end, the final saliency map is computed by fusing the global and local saliency maps together. Experimental results on MSRA 10K demonstrate the effectiveness and promising performance of our proposed method.
Collapse
|
45
|
|
46
|
|
47
|
Liu H, Xu M, Wang J, Rao T, Burnett I. Improving Visual Saliency Computing With Emotion Intensity. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:1201-1213. [PMID: 27214350 DOI: 10.1109/tnnls.2016.2553579] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Saliency maps that integrate individual feature maps into a global measure of visual attention are widely used to estimate human gaze density. Most of the existing methods consider low-level visual features and locations of objects, and/or emphasize the spatial position with center prior. Recent psychology research suggests that emotions strongly influence human visual attention. In this paper, we explore the influence of emotional content on visual attention. On top of the traditional bottom-up saliency map generation, our saliency map is generated in cooperation with three emotion factors, i.e., general emotional content, facial expression intensity, and emotional object locations. Experiments, carried out on National University of Singapore Eye Fixation (a public eye tracking data set), demonstrate that incorporating emotion does improve the quality of visual saliency maps computed by bottom-up approaches for the gaze density estimation. Our method increases about 0.1 on an average of area under the curve of receiver operation characteristic curve, compared with the four baseline bottom-up approaches (Itti's, attention based on information maximization, saliency using natural, and graph-based vision saliency).
Collapse
|
48
|
Lu H, Li X, Zhang L, Ruan X, Yang MH. Dense and Sparse Reconstruction Error Based Saliency Descriptor. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:1592-1603. [PMID: 26915102 DOI: 10.1109/tip.2016.2524198] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
In this paper, we propose a visual saliency detection algorithm from the perspective of reconstruction error. The image boundaries are first extracted via superpixels as likely cues for background templates, from which dense and sparse appearance models are constructed. First, we compute dense and sparse reconstruction errors on the background templates for each image region. Second, the reconstruction errors are propagated based on the contexts obtained from K -means clustering. Third, the pixel-level reconstruction error is computed by the integration of multi-scale reconstruction errors. Both the pixel-level dense and sparse reconstruction errors are then weighted by image compactness, which could more accurately detect saliency. In addition, we introduce a novel Bayesian integration method to combine saliency maps, which is applied to integrate the two saliency measures based on dense and sparse reconstruction errors. Experimental results show that the proposed algorithm performs favorably against 24 state-of-the-art methods in terms of precision, recall, and F-measure on three public standard salient object detection databases.
Collapse
|
49
|
Measuring Visual Surprise Jointly from Intrinsic and Extrinsic Contexts for Image Saliency Estimation. Int J Comput Vis 2016. [DOI: 10.1007/s11263-016-0892-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
50
|
Visual Saliency Using Binary Spectrum of Walsh–Hadamard Transform and Its Applications to Ship Detection in Multispectral Imagery. Neural Process Lett 2016. [DOI: 10.1007/s11063-016-9507-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|