1
|
Makram AW, Salem NM, El-Wakad MT, Al-Atabany W. Robust detection and refinement of saliency identification. Sci Rep 2024; 14:11076. [PMID: 38744990 DOI: 10.1038/s41598-024-61105-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Accepted: 05/02/2024] [Indexed: 05/16/2024] Open
Abstract
Salient object detection is an increasingly popular topic in the computer vision field, particularly for images with complex backgrounds and diverse object parts. Background information is an essential factor in detecting salient objects. This paper suggests a robust and effective methodology for salient object detection. This method involves two main stages. The first stage is to produce a saliency detection map based on the dense and sparse reconstruction of image regions using a refined background dictionary. The refined background dictionary uses a boundary conductivity measurement to exclude salient object regions near the image's boundary from a background dictionary. In the second stage, the CascadePSP network is integrated to refine and correct the local boundaries of the saliency mask to highlight saliency objects more uniformly. Using six evaluation indexes, experimental outcomes conducted on three datasets show that the proposed approach performs effectively compared to the state-of-the-art methods in salient object detection, particularly in identifying the challenging salient objects located near the image's boundary. These results demonstrate the potential of the proposed framework for various computer vision applications.
Collapse
Affiliation(s)
- Abram W Makram
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Helwan, Egypt.
| | - Nancy M Salem
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Helwan, Egypt
| | | | - Walid Al-Atabany
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Helwan, Egypt
- Information Technology and Computer Science School, Nile University, Giza, Egypt
| |
Collapse
|
2
|
Huo F, Liu Z, Guo J, Xu W, Guo S. UTDNet: A unified triplet decoder network for multimodal salient object detection. Neural Netw 2024; 170:521-534. [PMID: 38043372 DOI: 10.1016/j.neunet.2023.11.051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 10/11/2023] [Accepted: 11/22/2023] [Indexed: 12/05/2023]
Abstract
Image Salient Object Detection (SOD) is a fundamental research topic in the area of computer vision. Recently, the multimodal information in RGB, Depth (D), and Thermal (T) modalities has been proven to be beneficial to the SOD. However, existing methods are only designed for RGB-D or RGB-T SOD, which may limit the utilization in various modalities, or just finetuned on specific datasets, which may bring about extra computation overhead. These defects can hinder the practical deployment of SOD in real-world applications. In this paper, we propose an end-to-end Unified Triplet Decoder Network, dubbed UTDNet, for both RGB-T and RGB-D SOD tasks. The intractable challenges for the unified multimodal SOD are mainly two-fold, i.e., (1) accurately detecting and segmenting salient objects, and (2) preferably via a single network that fits both RGB-T and RGB-D SOD. First, to deal with the former challenge, we propose the multi-scale feature extraction unit to enrich the discriminative contextual information, and the efficient fusion module to explore cross-modality complementary information. Then, the multimodal features are fed to the triplet decoder, where the hierarchical deep supervision loss further enable the network to capture distinctive saliency cues. Second, as to the latter challenge, we propose a simple yet effective continual learning method to unify multimodal SOD. Concretely, we sequentially train multimodal SOD tasks by applying Elastic Weight Consolidation (EWC) regularization with the hierarchical loss function to avoid catastrophic forgetting without inducing more parameters. Critically, the triplet decoder separates task-specific and task-invariant information, making the network easily adaptable to multimodal SOD tasks. Extensive comparisons with 26 recently proposed RGB-T and RGB-D SOD methods demonstrate the superiority of the proposed UTDNet.
Collapse
Affiliation(s)
- Fushuo Huo
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China
| | - Ziming Liu
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China
| | - Jingcai Guo
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
| | - Wenchao Xu
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China.
| | - Song Guo
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region of China
| |
Collapse
|
3
|
Lai B, Liu M, Ryan F, Rehg JM. In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond. Int J Comput Vis 2023; 132:854-871. [PMID: 38371492 PMCID: PMC10873248 DOI: 10.1007/s11263-023-01879-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Accepted: 08/10/2023] [Indexed: 02/20/2024]
Abstract
Predicting human's gaze from egocentric videos serves as a critical role for human intention understanding in daily activities. In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and further propose a novel global-local correlation module to explicitly model the correlation of the global token and each local token. We validate our model on two egocentric video datasets - EGTEA Gaze + and Ego4D. Our detailed ablation studies demonstrate the benefits of our method. In addition, our approach exceeds the previous state-of-the-art model by a large margin. We also apply our model to a novel gaze saccade/fixation prediction task and the traditional action recognition problem. The consistent gains suggest the strong generalization capability of our model. We also provide additional visualizations to support our claim that global-local correlation serves a key representation for predicting gaze fixation from egocentric videos. More details can be found in our website (https://bolinlai.github.io/GLC-EgoGazeEst).
Collapse
Affiliation(s)
- Bolin Lai
- Georgia Institute of Technology, Atlanta, GA 30308 USA
| | - Miao Liu
- Georgia Institute of Technology, Atlanta, GA 30308 USA
- Meta AI, Menlo Park, CA 94025 USA
| | - Fiona Ryan
- Georgia Institute of Technology, Atlanta, GA 30308 USA
| | - James M. Rehg
- Georgia Institute of Technology, Atlanta, GA 30308 USA
| |
Collapse
|
4
|
Bruckert A, Christie M, Le Meur O. Where to look at the movies: Analyzing visual attention to understand movie editing. Behav Res Methods 2023; 55:2940-2959. [PMID: 36002630 DOI: 10.3758/s13428-022-01949-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/29/2022] [Indexed: 11/08/2022]
Abstract
In the process of making a movie, directors constantly care about where the spectator will look on the screen. Shot composition, framing, camera movements, or editing are tools commonly used to direct attention. In order to provide a quantitative analysis of the relationship between those tools and gaze patterns, we propose a new eye-tracking database, containing gaze-pattern information on movie sequences, as well as editing annotations, and we show how state-of-the-art computational saliency techniques behave on this dataset. In this work, we expose strong links between movie editing and spectators gaze distributions, and open several leads on how the knowledge of editing information could improve human visual attention modeling for cinematic content. The dataset generated and analyzed for this study is available at https://github.com/abruckert/eye_tracking_filmmaking.
Collapse
|
5
|
Malladi SPK, Mukherjee J, Larabi MC, Chaudhury S. Towards explainable deep visual saliency models. COMPUTER VISION AND IMAGE UNDERSTANDING 2023:103782. [DOI: 10.1016/j.cviu.2023.103782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/19/2023]
|
6
|
Novin S, Fallah A, Rashidi S, Daliri MR. An improved saliency model of visual attention dependent on image content. Front Hum Neurosci 2023; 16:862588. [PMID: 36926377 PMCID: PMC10011177 DOI: 10.3389/fnhum.2022.862588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 11/14/2022] [Indexed: 03/08/2023] Open
Abstract
Many visual attention models have been presented to obtain the saliency of a scene, i.e., the visually significant parts of a scene. However, some mechanisms are still not taken into account in these models, and the models do not fit the human data accurately. These mechanisms include which visual features are informative enough to be incorporated into the model, how the conspicuity of different features and scales of an image may integrate to obtain the saliency map of the image, and how the structure of an image affects the strategy of our attention system. We integrate such mechanisms in the presented model more efficiently compared to previous models. First, besides low-level features commonly employed in state-of-the-art models, we also apply medium-level features as the combination of orientations and colors based on the visual system behavior. Second, we use a variable number of center-surround difference maps instead of the fixed number used in the other models, suggesting that human visual attention operates differently for diverse images with different structures. Third, we integrate the information of different scales and different features based on their weighted sum, defining the weights according to each component's contribution, and presenting both the local and global saliency of the image. To test the model's performance in fitting human data, we compared it to other models using the CAT2000 dataset and the Area Under Curve (AUC) metric. Our results show that the model has high performance compared to the other models (AUC = 0.79 and sAUC = 0.58) and suggest that the proposed mechanisms can be applied to the existing models to improve them.
Collapse
Affiliation(s)
- Shabnam Novin
- Faculty of Biomedical Engineering, Amirkabir University of Technology (AUT), Tehran, Iran
| | - Ali Fallah
- Faculty of Biomedical Engineering, Amirkabir University of Technology (AUT), Tehran, Iran
| | - Saeid Rashidi
- Faculty of Medical Sciences and Technologies, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Mohammad Reza Daliri
- Neuroscience and Neuroengineering Research Laboratory, Biomedical Engineering Department, School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran
- School of Cognitive Sciences (SCS), Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| |
Collapse
|
7
|
Fan S, Shen Z, Jiang M, Koenig BL, Kankanhalli MS, Zhao Q. Emotional Attention: From Eye Tracking to Computational Modeling. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:1682-1699. [PMID: 35446761 DOI: 10.1109/tpami.2022.3169234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Attending selectively to emotion-eliciting stimuli is intrinsic to human vision. In this research, we investigate how emotion-elicitation features of images relate to human selective attention. We create the EMOtional attention dataset (EMOd). It is a set of diverse emotion-eliciting images, each with (1) eye-tracking data from 16 subjects, (2) image context labels at both object- and scene-level. Based on analyses of human perceptions of EMOd, we report an emotion prioritization effect: emotion-eliciting content draws stronger and earlier human attention than neutral content, but this advantage diminishes dramatically after initial fixation. We find that human attention is more focused on awe eliciting and aesthetic vehicle and animal scenes in EMOd. Aiming to model the above human attention behavior computationally, we design a deep neural network (CASNet II), which includes a channel weighting subnetwork that prioritizes emotion-eliciting objects, and an Atrous Spatial Pyramid Pooling (ASPP) structure that learns the relative importance of image regions at multiple scales. Visualizations and quantitative analyses demonstrate the model's ability to simulate human attention behavior, especially on emotion-eliciting content.
Collapse
|
8
|
Audio–visual collaborative representation learning for Dynamic Saliency Prediction. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
9
|
Liu N, Li L, Zhao W, Han J, Shao L. Instance-Level Relative Saliency Ranking With Graph Reasoning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:8321-8337. [PMID: 34437057 DOI: 10.1109/tpami.2021.3107872] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Conventional salient object detection models cannot differentiate the importance of different salient objects. Recently, two works have been proposed to detect saliency ranking by assigning different degrees of saliency to different objects. However, one of these models cannot differentiate object instances and the other focuses more on sequential attention shift order inference. In this paper, we investigate a practical problem setting that requires simultaneously segment salient instances and infer their relative saliency rank order. We present a novel unified model as the first end-to-end solution, where an improved Mask R-CNN is first used to segment salient instances and a saliency ranking branch is then added to infer the relative saliency. For relative saliency ranking, we build a new graph reasoning module by combining four graphs to incorporate the instance interaction relation, local contrast, global contrast, and a high-level semantic prior, respectively. A novel loss function is also proposed to effectively train the saliency ranking branch. Besides, a new dataset and an evaluation metric are proposed for this task, aiming at pushing forward this field of research. Finally, experimental results demonstrate that our proposed model is more effective than previous methods. We also show an example of its practical usage on adaptive image retargeting.
Collapse
|
10
|
Multi-task visual discomfort prediction model for stereoscopic images based on multi-view feature representation. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04156-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
11
|
Zeng L, Li T, Wang X, Chen L, Zeng P, Herrin JS. UNetGE: A U-Net-Based Software at Automatic Grain Extraction for Image Analysis of the Grain Size and Shape Characteristics. SENSORS (BASEL, SWITZERLAND) 2022; 22:5565. [PMID: 35898069 PMCID: PMC9330053 DOI: 10.3390/s22155565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 07/12/2022] [Accepted: 07/19/2022] [Indexed: 06/15/2023]
Abstract
The shape and the size of grains in sediments and soils have a significant influence on their engineering properties. Image analysis of grain shape and size has been increasingly applied in geotechnical engineering to provide a quantitative statistical description for grain morphologies. The statistic robustness and the era of big data in geotechnical engineering require the quick and efficient acquirement of large data sets of grain morphologies. In the past publications, some semi-automation algorithms in extracting grains from images may cost tens of minutes. With the rapid development of deep learning networks applied to earth sciences, we develop UNetGE software that is based on the U-Net architecture-a fully convolutional network-to recognize and segregate grains from the matrix using the electron and optical microphotographs of rock and soil thin sections or the photographs of their hand specimen and outcrops. Resultantly, it shows that UNetGE can extract approximately 300~1300 grains in a few seconds to a few minutes and provide their morphologic parameters, which will ably assist with analyses on the engineering properties of sediments and soils (e.g., permeability, strength, and expansivity) and their hydraulic characteristics.
Collapse
Affiliation(s)
- Ling Zeng
- Geomathematics Key Laboratory of Sichuan Province, Chengdu University of Technology, Chengdu 610059, China;
| | - Tianbin Li
- State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu 610059, China; (T.L.); (P.Z.)
| | - Xiekang Wang
- State Key Laboratory of Hydraulics and Mountain River Engineering, Sichuan University, Chengdu 610065, China;
| | - Lei Chen
- Geomathematics Key Laboratory of Sichuan Province, Chengdu University of Technology, Chengdu 610059, China;
| | - Peng Zeng
- State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu 610059, China; (T.L.); (P.Z.)
| | - Jason Scott Herrin
- Facility for Analysis Characterization Testing Simulation, Nanyang Technological University, Singapore 639798, Singapore;
| |
Collapse
|
12
|
Pei J, Zhou T, Tang H, Liu C, Chen C. FGO-Net: Feature and Gaussian Optimization Network for visual saliency prediction. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03647-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
13
|
|
14
|
DeepRare: Generic Unsupervised Visual Attention Models. ELECTRONICS 2022. [DOI: 10.3390/electronics11111696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Visual attention selects data considered as “interesting” by humans, and it is modeled in the field of engineering by feature-engineered methods finding contrasted/surprising/unusual image data. Deep learning drastically improved the models efficiency on the main benchmark datasets. However, Deep Neural Networks-based (DNN-based) models are counterintuitive: surprising or unusual data are by definition difficult to learn because of their low occurrence probability. In reality, DNN-based models mainly learn top-down features such as faces, text, people, or animals which usually attract human attention, but they have low efficiency in extracting surprising or unusual data in the images. In this article, we propose a new family of visual attention models called DeepRare and especially DeepRare2021 (DR21), which uses the power of DNNs’ feature extraction and the genericity of feature-engineered algorithms. This algorithm is an evolution of a previous version called DeepRare2019 (DR19) based on this common framework. DR21 (1) does not need any additional training other than the default ImageNet training, (2) is fast even on CPU, (3) is tested on four very different eye-tracking datasets showing that DR21 is generic and is always within the top models on all datasets and metrics while no other model exhibits such a regularity and genericity. Finally, DR21 (4) is tested with several network architectures such as VGG16 (V16), VGG19 (V19), and MobileNetV2 (MN2), and (5) it provides explanation and transparency on which parts of the image are the most surprising at different levels despite the use of a DNN-based feature extractor.
Collapse
|
15
|
Hayes TR, Henderson JM. Meaning maps detect the removal of local semantic scene content but deep saliency models do not. Atten Percept Psychophys 2022; 84:647-654. [PMID: 35138579 PMCID: PMC11128357 DOI: 10.3758/s13414-021-02395-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/12/2021] [Indexed: 11/08/2022]
Abstract
Meaning mapping uses human raters to estimate different semantic features in scenes, and has been a useful tool in demonstrating the important role semantics play in guiding attention. However, recent work has argued that meaning maps do not capture semantic content, but like deep learning models of scene attention, represent only semantically-neutral image features. In the present study, we directly tested this hypothesis using a diffeomorphic image transformation that is designed to remove the meaning of an image region while preserving its image features. Specifically, we tested whether meaning maps and three state-of-the-art deep learning models were sensitive to the loss of semantic content in this critical diffeomorphed scene region. The results were clear: meaning maps generated by human raters showed a large decrease in the diffeomorphed scene regions, while all three deep saliency models showed a moderate increase in the diffeomorphed scene regions. These results demonstrate that meaning maps reflect local semantic content in scenes while deep saliency models do something else. We conclude the meaning mapping approach is an effective tool for estimating semantic content in scenes.
Collapse
Affiliation(s)
- Taylor R Hayes
- Center for Mind and Brain, University of California, Davis, CA, USA.
| | - John M Henderson
- Center for Mind and Brain, University of California, Davis, CA, USA
- Department of Psychology, University of California, Davis, CA, USA
| |
Collapse
|
16
|
Futagami T, Hayasaka N. Improvement in automatic food region extraction based on saliency detection. INTERNATIONAL JOURNAL OF FOOD PROPERTIES 2022. [DOI: 10.1080/10942912.2022.2055056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Takuya Futagami
- Department of Engineering Informatics, Osaka Electro-Communication University, Neyagawa,Osaka, JAPAN
| | - Noboru Hayasaka
- Department of Engineering Informatics, Osaka Electro-Communication University, Neyagawa,Osaka, JAPAN
| |
Collapse
|
17
|
Object Categorization Capability of Psychological Potential Field in Perceptual Assessment Using Line-Drawing Images. J Imaging 2022; 8:jimaging8040090. [PMID: 35448217 PMCID: PMC9026922 DOI: 10.3390/jimaging8040090] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 03/24/2022] [Accepted: 03/25/2022] [Indexed: 12/04/2022] Open
Abstract
Affective/cognitive engineering investigations typically require the quantitative assessment of object perception. Recent research has suggested that certain perceptions of object categorization can be derived from human eye fixation and that color images and line drawings induce similar neural activities. Line drawings contain less information than color images; therefore, line drawings are expected to simplify the investigations of object perception. The psychological potential field (PPF), which is a psychological feature, is an image feature of line drawings. On the basis of the PPF, the possibility that the general human perception of object categorization can be assessed from the similarity to fixation maps (FMs) generated from human eye fixations has been reported. However, this may be due to chance because image features other than the PPF have not been compared with FMs. This study examines the potential and effectiveness of the PPF by comparing its performance with that of other image features in terms of the similarity to FMs. The results show that the PPF shows the ideal performance for assessing the perception of object categorization. In particular, the PPF effectively distinguishes between animal and nonanimal targets; however, real-time assessment is difficult.
Collapse
|
18
|
Where Is My Mind (Looking at)? A Study of the EEG–Visual Attention Relationship. INFORMATICS 2022. [DOI: 10.3390/informatics9010026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Visual attention estimation is an active field of research at the crossroads of different disciplines: computer vision, deep learning, and medicine. One of the most common approaches to estimate a saliency map representing attention is based on the observed images. In this paper, we show that visual attention can be retrieved from EEG acquisition. The results are comparable to traditional predictions from observed images, which is of great interest. Image-based saliency estimation being participant independent, the estimation from EEG could take into account the subject specificity. For this purpose, a set of signals has been recorded, and different models have been developed to study the relationship between visual attention and brain activity. The results are encouraging and comparable with other approaches estimating attention with other modalities. Being able to predict a visual saliency map from EEG could help in research studying the relationship between brain activity and visual attention. It could also help in various applications: vigilance assessment during driving, neuromarketing, and also in the help for the diagnosis and treatment of visual attention-related diseases. For the sake of reproducibility, the codes and dataset considered in this paper have been made publicly available to promote research in the field.
Collapse
|
19
|
Amunts K, DeFelipe J, Pennartz C, Destexhe A, Migliore M, Ryvlin P, Furber S, Knoll A, Bitsch L, Bjaalie JG, Ioannidis Y, Lippert T, Sanchez-Vives MV, Goebel R, Jirsa V. Linking Brain Structure, Activity, and Cognitive Function through Computation. eNeuro 2022; 9:ENEURO.0316-21.2022. [PMID: 35217544 PMCID: PMC8925650 DOI: 10.1523/eneuro.0316-21.2022] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 01/11/2022] [Accepted: 01/17/2022] [Indexed: 01/19/2023] Open
Abstract
Understanding the human brain is a "Grand Challenge" for 21st century research. Computational approaches enable large and complex datasets to be addressed efficiently, supported by artificial neural networks, modeling and simulation. Dynamic generative multiscale models, which enable the investigation of causation across scales and are guided by principles and theories of brain function, are instrumental for linking brain structure and function. An example of a resource enabling such an integrated approach to neuroscientific discovery is the BigBrain, which spatially anchors tissue models and data across different scales and ensures that multiscale models are supported by the data, making the bridge to both basic neuroscience and medicine. Research at the intersection of neuroscience, computing and robotics has the potential to advance neuro-inspired technologies by taking advantage of a growing body of insights into perception, plasticity and learning. To render data, tools and methods, theories, basic principles and concepts interoperable, the Human Brain Project (HBP) has launched EBRAINS, a digital neuroscience research infrastructure, which brings together a transdisciplinary community of researchers united by the quest to understand the brain, with fascinating insights and perspectives for societal benefits.
Collapse
Affiliation(s)
- Katrin Amunts
- Institute of Neurosciences and Medicine (INM-1), Research Centre Jülich, Jülich 52425, Germany
- C. & O. Vogt Institute for Brain Research, University Hospital Düsseldorf, Heinrich-Heine University Düsseldorf, Düsseldorf 40225, Germany
| | - Javier DeFelipe
- Laboratorio Cajal de Circuitos Corticales, Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid 28223, Spain
- Instituto Cajal, Consejo Superior de Investigaciones Científicas (CSIC), Madrid 28002, Spain
| | - Cyriel Pennartz
- Cognitive and Systems Neuroscience Group, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, 1098 XH, The Netherlands
| | - Alain Destexhe
- Centre National de la Recherche Scientifique, Institute of Neuroscience (NeuroPSI), Paris-Saclay University, Gif sur Yvette 91400, France
| | - Michele Migliore
- Institute of Biophysics, National Research Council, Palermo 90146, Italy
| | - Philippe Ryvlin
- Department of Clinical Neurosciences, Centre Hospitalier Universitaire Vaudois, Lausanne CH-1011, Switzerland
| | - Steve Furber
- Department of Computer Science, The University of Manchester, Manchester M13 9PL, United Kingdom
| | - Alois Knoll
- Department of Informatics, Technical University of Munich, Garching 385748, Germany
| | - Lise Bitsch
- The Danish Board of Technology Foundation, Copenhagen, 2650 Hvidovre, Denmark
| | - Jan G Bjaalie
- Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
| | - Yannis Ioannidis
- ATHENA Research & Innovation Center, Athena 12125, Greece
- Department of Informatics & Telecom, Nat'l and Kapodistrian University of Athens, 157 84 Athens, Greece
| | - Thomas Lippert
- Institute for Advanced Simulation (IAS), Jülich Supercomputing Centre (JSC), Research Centre Jülich, Jülich 52425, Germany
| | - Maria V Sanchez-Vives
- ICREA and Systems Neuroscience, Institute of Biomedical Investigations August Pi i Sunyer, Barcelona 08036, Spain
| | - Rainer Goebel
- Department of Cognitive Neuroscience, Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht 6229 EV, The Netherlands
| | - Viktor Jirsa
- Aix Marseille Université, Institut National de la Santé et de la Recherche Médicale, Institut de Neurosciences des Systèmes (INS) UMR1106, Marseille 13005, France
| |
Collapse
|
20
|
Pedziwiatr MA, Kümmerer M, Wallis TSA, Bethge M, Teufel C. Semantic object-scene inconsistencies affect eye movements, but not in the way predicted by contextualized meaning maps. J Vis 2022; 22:9. [PMID: 35171232 PMCID: PMC8857618 DOI: 10.1167/jov.22.2.9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Semantic information is important in eye movement control. An important semantic influence on gaze guidance relates to object-scene relationships: objects that are semantically inconsistent with the scene attract more fixations than consistent objects. One interpretation of this effect is that fixations are driven toward inconsistent objects because they are semantically more informative. We tested this explanation using contextualized meaning maps, a method that is based on crowd-sourced ratings to quantify the spatial distribution of context-sensitive “meaning” in images. In Experiment 1, we compared gaze data and contextualized meaning maps for images, in which objects-scene consistency was manipulated. Observers fixated more on inconsistent versus consistent objects. However, contextualized meaning maps did not assign higher meaning to image regions that contained semantic inconsistencies. In Experiment 2, a large number of raters evaluated image-regions, which were deliberately selected for their content and expected meaningfulness. The results suggest that the same scene locations were experienced as slightly less meaningful when they contained inconsistent compared to consistent objects. In summary, we demonstrated that — in the context of our rating task — semantically inconsistent objects are experienced as less meaningful than their consistent counterparts and that contextualized meaning maps do not capture prototypical influences of image meaning on gaze guidance.
Collapse
Affiliation(s)
- Marek A Pedziwiatr
- Cardiff University, Cardiff University Brain Research Imaging Centre (CUBRIC), School of Psychology, Cardiff, UK.,Queen Mary University of London, Department of Biological and Experimental Psychology, London, UK.,
| | | | - Thomas S A Wallis
- Technical University of Darmstadt, Institute for Psychology and Centre for Cognitive Science, Darmstadt, Germany.,
| | | | - Christoph Teufel
- Cardiff University, Cardiff University Brain Research Imaging Centre (CUBRIC), School of Psychology, Cardiff, UK.,
| |
Collapse
|
21
|
Review of Visual Saliency Prediction: Development Process from Neurobiological Basis to Deep Models. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app12010309] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The human attention mechanism can be understood and simulated by closely associating the saliency prediction task to neuroscience and psychology. Furthermore, saliency prediction is widely used in computer vision and interdisciplinary subjects. In recent years, with the rapid development of deep learning, deep models have made amazing achievements in saliency prediction. Deep learning models can automatically learn features, thus solving many drawbacks of the classic models, such as handcrafted features and task settings, among others. Nevertheless, the deep models still have some limitations, for example in tasks involving multi-modality and semantic understanding. This study focuses on summarizing the relevant achievements in the field of saliency prediction, including the early neurological and psychological mechanisms and the guiding role of classic models, followed by the development process and data comparison of classic and deep saliency prediction models. This study also discusses the relationship between the model and human vision, as well as the factors that cause the semantic gaps, the influences of attention in cognitive research, the limitations of the saliency model, and the emerging applications, to provide new saliency predictions for follow-up work and the necessary help and advice.
Collapse
|
22
|
Luna-Jiménez C, Griol D, Callejas Z, Kleinlein R, Montero JM, Fernández-Martínez F. Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning. SENSORS 2021; 21:s21227665. [PMID: 34833739 PMCID: PMC8618559 DOI: 10.3390/s21227665] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 11/12/2021] [Accepted: 11/15/2021] [Indexed: 11/29/2022]
Abstract
Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.
Collapse
Affiliation(s)
- Cristina Luna-Jiménez
- Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid, Avda. Complutense 30, 28040 Madrid, Spain; (R.K.); (J.M.M.); (F.F.-M.)
- Correspondence:
| | - David Griol
- Department of Software Engineering, CITIC-UGR, University of Granada, Periodista Daniel Saucedo Aranda S/N, 18071 Granada, Spain; (D.G.); (Z.C.)
| | - Zoraida Callejas
- Department of Software Engineering, CITIC-UGR, University of Granada, Periodista Daniel Saucedo Aranda S/N, 18071 Granada, Spain; (D.G.); (Z.C.)
| | - Ricardo Kleinlein
- Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid, Avda. Complutense 30, 28040 Madrid, Spain; (R.K.); (J.M.M.); (F.F.-M.)
| | - Juan M. Montero
- Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid, Avda. Complutense 30, 28040 Madrid, Spain; (R.K.); (J.M.M.); (F.F.-M.)
| | - Fernando Fernández-Martínez
- Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid, Avda. Complutense 30, 28040 Madrid, Spain; (R.K.); (J.M.M.); (F.F.-M.)
| |
Collapse
|
23
|
Deng X, Zhang Z. Sparsity-control ternary weight networks. Neural Netw 2021; 145:221-232. [PMID: 34773898 DOI: 10.1016/j.neunet.2021.10.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 08/10/2021] [Accepted: 10/21/2021] [Indexed: 11/18/2022]
Abstract
Deep neural networks (DNNs) have been widely and successfully applied to various applications, but they require large amounts of memory and computational power. This severely restricts their deployment on resource-limited devices. To address this issue, many efforts have been made on training low-bit weight DNNs. In this paper, we focus on training ternary weight {-1, 0, +1} networks which can avoid multiplications and dramatically reduce the memory and computation requirements. A ternary weight network can be considered as a sparser version of the binary weight counterpart by replacing some -1s or 1s in the binary weights with 0s, thus leading to more efficient inference but more memory cost. However, the existing approaches to train ternary weight networks cannot control the sparsity (i.e., percentage of 0s) of the ternary weights, which undermines the advantage of ternary weights. In this paper, we propose to our best knowledge the first sparsity-control approach (SCA) to train ternary weight networks, which is simply achieved by a weight discretization regularizer (WDR). SCA is different from all the existing regularizer-based approaches in that it can control the sparsity of the ternary weights through a controller α and does not rely on gradient estimators. We theoretically and empirically show that the sparsity of the trained ternary weights is positively related to α. SCA is extremely simple, easy-to-implement, and is shown to consistently outperform the state-of-the-art approaches significantly over several benchmark datasets and even matches the performances of the full-precision weight counterparts.
Collapse
Affiliation(s)
- Xiang Deng
- State University of New York at Binghamton, Binghamton, NY, United States.
| | - Zhongfei Zhang
- State University of New York at Binghamton, Binghamton, NY, United States
| |
Collapse
|
24
|
Abstract
AbstractIn this work, we propose a 3D fully convolutional architecture for video saliency prediction that employs hierarchical supervision on intermediate maps (referred to as conspicuity maps) generated using features extracted at different abstraction levels. We provide the base hierarchical learning mechanism with two techniques for domain adaptation and domain-specific learning. For the former, we encourage the model to unsupervisedly learn hierarchical general features using gradient reversal at multiple scales, to enhance generalization capabilities on datasets for which no annotations are provided during training. As for domain specialization, we employ domain-specific operations (namely, priors, smoothing and batch normalization) by specializing the learned features on individual datasets in order to maximize performance. The results of our experiments show that the proposed model yields state-of-the-art accuracy on supervised saliency prediction. When the base hierarchical model is empowered with domain-specific modules, performance improves, outperforming state-of-the-art models on three out of five metrics on the DHF1K benchmark and reaching the second-best results on the other two. When, instead, we test it in an unsupervised domain adaptation setting, by enabling hierarchical gradient reversal layers, we obtain performance comparable to supervised state-of-the-art. Source code, trained models and example outputs are publicly available at https://github.com/perceivelab/hd2s.
Collapse
|
25
|
Malladi SPK, Mukhopadhyay J, Larabi C, Chaudhury S. Lighter and Faster Cross-Concatenated Multi-Scale Residual Block Based Network for Visual Saliency Prediction. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP) 2021. [DOI: 10.1109/icip42928.2021.9506710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/19/2023]
Affiliation(s)
| | - Jayanta Mukhopadhyay
- IIT Kharagpur,Visual Information Processing Lab,Dept. of Computer Science & Engg.,India
| | | | | |
Collapse
|
26
|
Hayes TR, Henderson JM. Deep saliency models learn low-, mid-, and high-level features to predict scene attention. Sci Rep 2021; 11:18434. [PMID: 34531484 PMCID: PMC8445969 DOI: 10.1038/s41598-021-97879-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 08/31/2021] [Indexed: 02/08/2023] Open
Abstract
Deep saliency models represent the current state-of-the-art for predicting where humans look in real-world scenes. However, for deep saliency models to inform cognitive theories of attention, we need to know how deep saliency models prioritize different scene features to predict where people look. Here we open the black box of three prominent deep saliency models (MSI-Net, DeepGaze II, and SAM-ResNet) using an approach that models the association between attention, deep saliency model output, and low-, mid-, and high-level scene features. Specifically, we measured the association between each deep saliency model and low-level image saliency, mid-level contour symmetry and junctions, and high-level meaning by applying a mixed effects modeling approach to a large eye movement dataset. We found that all three deep saliency models were most strongly associated with high-level and low-level features, but exhibited qualitatively different feature weightings and interaction patterns. These findings suggest that prominent deep saliency models are primarily learning image features associated with high-level scene meaning and low-level image saliency and highlight the importance of moving beyond simply benchmarking performance.
Collapse
Affiliation(s)
- Taylor R Hayes
- Center for Mind and Brain, University of California, Davis, 95618, USA.
| | - John M Henderson
- Center for Mind and Brain, University of California, Davis, 95618, USA
- Department of Psychology, University of California, Davis, 95616, USA
| |
Collapse
|
27
|
|
28
|
Predicting atypical visual saliency for autism spectrum disorder via scale-adaptive inception module and discriminative region enhancement loss. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.06.125] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
29
|
Abstract
Spatial Transformer Networks are considered a powerful algorithm to learn the main areas of an image, but still, they could be more efficient by receiving images with embedded expert knowledge. This paper aims to improve the performance of conventional Spatial Transformers when applied to Facial Expression Recognition. Based on the Spatial Transformers’ capacity of spatial manipulation within networks, we propose different extensions to these models where effective attentional regions are captured employing facial landmarks or facial visual saliency maps. This specific attentional information is then hardcoded to guide the Spatial Transformers to learn the spatial transformations that best fit the proposed regions for better recognition results. For this study, we use two datasets: AffectNet and FER-2013. For AffectNet, we achieve a 0.35% point absolute improvement relative to the traditional Spatial Transformer, whereas for FER-2013, our solution gets an increase of 1.49% when models are fine-tuned with the Affectnet pre-trained weights.
Collapse
|
30
|
Svanera M, Morgan AT, Petro LS, Muckli L. A self-supervised deep neural network for image completion resembles early visual cortex fMRI activity patterns for occluded scenes. J Vis 2021; 21:5. [PMID: 34259828 PMCID: PMC8288063 DOI: 10.1167/jov.21.7.5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Accepted: 05/14/2021] [Indexed: 11/24/2022] Open
Abstract
The promise of artificial intelligence in understanding biological vision relies on the comparison of computational models with brain data with the goal of capturing functional principles of visual information processing. Convolutional neural networks (CNN) have successfully matched the transformations in hierarchical processing occurring along the brain's feedforward visual pathway, extending into ventral temporal cortex. However, we are still to learn if CNNs can successfully describe feedback processes in early visual cortex. Here, we investigated similarities between human early visual cortex and a CNN with encoder/decoder architecture, trained with self-supervised learning to fill occlusions and reconstruct an unseen image. Using representational similarity analysis (RSA), we compared 3T functional magnetic resonance imaging (fMRI) data from a nonstimulated patch of early visual cortex in human participants viewing partially occluded images, with the different CNN layer activations from the same images. Results show that our self-supervised image-completion network outperforms a classical object-recognition supervised network (VGG16) in terms of similarity to fMRI data. This work provides additional evidence that optimal models of the visual system might come from less feedforward architectures trained with less supervision. We also find that CNN decoder pathway activations are more similar to brain processing compared to encoder activations, suggesting an integration of mid- and low/middle-level features in early visual cortex. Challenging an artificial intelligence model to learn natural image representations via self-supervised learning and comparing them with brain data can help us to constrain our understanding of information processing, such as neuronal predictive coding.
Collapse
Affiliation(s)
- Michele Svanera
- Centre for Cognitive Neuroimaging, Institute of Neuroscience and Psychology, University of Glasgow, UK
| | - Andrew T Morgan
- Centre for Cognitive Neuroimaging, Institute of Neuroscience and Psychology, University of Glasgow, UK
| | - Lucy S Petro
- Centre for Cognitive Neuroimaging, Institute of Neuroscience and Psychology, University of Glasgow, UK
| | - Lars Muckli
- Centre for Cognitive Neuroimaging, Institute of Neuroscience and Psychology, University of Glasgow, UK
| |
Collapse
|
31
|
Hierarchical Multimodal Adaptive Fusion (HMAF) Network for Prediction of RGB-D Saliency. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2020; 2020:8841681. [PMID: 33293945 PMCID: PMC7700038 DOI: 10.1155/2020/8841681] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2020] [Revised: 11/03/2020] [Accepted: 11/07/2020] [Indexed: 11/19/2022]
Abstract
Visual saliency prediction for RGB-D images is more challenging than that for their RGB counterparts. Additionally, very few investigations have been undertaken concerning RGB-D-saliency prediction. The proposed study presents a method based on a hierarchical multimodal adaptive fusion (HMAF) network to facilitate end-to-end prediction of RGB-D saliency. In the proposed method, hierarchical (multilevel) multimodal features are first extracted from an RGB image and depth map using a VGG-16-based two-stream network. Subsequently, the most significant hierarchical features of the said RGB image and depth map are predicted using three two-input attention modules. Furthermore, adaptive fusion of saliencies concerning the above-mentioned fused saliency features of different levels (hierarchical fusion saliency features) can be accomplished using a three-input attention module to facilitate high-accuracy RGB-D visual saliency prediction. Comparisons based on the application of the proposed HMAF-based approach against those of other state-of-the-art techniques on two challenging RGB-D datasets demonstrate that the proposed method outperforms other competing approaches consistently by a considerable margin.
Collapse
|