1
|
Vogelsang M, Vogelsang L, Gupta P, Gandhi TK, Shah P, Swami P, Gilad-Gutnick S, Ben-Ami S, Diamond S, Ganesh S, Sinha P. Impact of early visual experience on later usage of color cues. Science 2024; 384:907-912. [PMID: 38781366 DOI: 10.1126/science.adk9587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 03/29/2024] [Indexed: 05/25/2024]
Abstract
Human visual recognition is remarkably robust to chromatic changes. In this work, we provide a potential account of the roots of this resilience based on observations with 10 congenitally blind children who gained sight late in life. Several months or years following their sight-restoring surgeries, the removal of color cues markedly reduced their recognition performance, whereas age-matched normally sighted children showed no such decrement. This finding may be explained by the greater-than-neonatal maturity of the late-sighted children's color system at sight onset, inducing overly strong reliance on chromatic cues. Simulations with deep neural networks corroborate this hypothesis. These findings highlight the adaptive significance of typical developmental trajectories and provide guidelines for enhancing machine vision systems.
Collapse
Affiliation(s)
- Marin Vogelsang
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Institute of Cognitive Science, University of Osnabrueck, 49090 Osnabrueck, Germany
| | - Lukas Vogelsang
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Brain Mind Institute, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
| | - Priti Gupta
- Amarnath and Shashi Khosla School of Information Technology, Indian Institute of Technology, New Delhi 110016, India
- Project Prakash, Dr. Shroff's Charity Eye Hospital, New Delhi 110002, India
- Cognitive Science Programme, Dayalbagh Educational Institute, Agra 282005, India
| | - Tapan K Gandhi
- Department of Electrical Engineering, Indian Institute of Technology, New Delhi 110016, India
| | - Pragya Shah
- Project Prakash, Dr. Shroff's Charity Eye Hospital, New Delhi 110002, India
| | - Piyush Swami
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
- Danish Research Centre for Magnetic Resonance, Centre for Functional and Diagnostic Imaging and Research, Copenhagen University Hospital - Amager and Hvidovre, 2650 Hvidovre, Denmark
| | - Sharon Gilad-Gutnick
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Shlomit Ben-Ami
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Sidney Diamond
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Suma Ganesh
- Department of Pediatric Ophthalmology, Dr. Shroff's Charity Eye Hospital, New Delhi 110002, India
| | - Pawan Sinha
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
2
|
Zhu L, Wang JZ, Lee W, Wyble B. Incorporating simulated spatial context information improves the effectiveness of contrastive learning models. PATTERNS (NEW YORK, N.Y.) 2024; 5:100964. [PMID: 38800363 PMCID: PMC11117056 DOI: 10.1016/j.patter.2024.100964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Revised: 05/03/2023] [Accepted: 03/04/2024] [Indexed: 05/29/2024]
Abstract
Visual learning often occurs in a specific context, where an agent acquires skills through exploration and tracking of its location in a consistent environment. The historical spatial context of the agent provides a similarity signal for self-supervised contrastive learning. We present a unique approach, termed environmental spatial similarity (ESS), that complements existing contrastive learning methods. Using images from simulated, photorealistic environments as an experimental setting, we demonstrate that ESS outperforms traditional instance discrimination approaches. Moreover, sampling additional data from the same environment substantially improves accuracy and provides new augmentations. ESS allows remarkable proficiency in room classification and spatial prediction tasks, especially in unfamiliar environments. This learning paradigm has the potential to enable rapid visual learning in agents operating in new environments with unique visual characteristics. Potentially transformative applications span from robotics to space exploration. Our proof of concept demonstrates improved efficiency over methods that rely on extensive, disconnected datasets.
Collapse
Affiliation(s)
- Lizhen Zhu
- Data Science and Artificial Intelligence Area, College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA, USA
| | - James Z. Wang
- Data Science and Artificial Intelligence Area, College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA, USA
- Human-Computer Interaction Area, College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA, USA
- Department of Communication and Media, School of Social Sciences and Humanities, Loughborough University, Loughborough, Leicestershire, UK
| | - Wonseuk Lee
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
| | - Brad Wyble
- Department of Psychology, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
3
|
Hosseini EA, Schrimpf M, Zhang Y, Bowman S, Zaslavsky N, Fedorenko E. Artificial Neural Network Language Models Predict Human Brain Responses to Language Even After a Developmentally Realistic Amount of Training. NEUROBIOLOGY OF LANGUAGE (CAMBRIDGE, MASS.) 2024; 5:43-63. [PMID: 38645622 PMCID: PMC11025646 DOI: 10.1162/nol_a_00137] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 01/09/2024] [Indexed: 04/23/2024]
Abstract
Artificial neural networks have emerged as computationally plausible models of human language processing. A major criticism of these models is that the amount of training data they receive far exceeds that of humans during language learning. Here, we use two complementary approaches to ask how the models' ability to capture human fMRI responses to sentences is affected by the amount of training data. First, we evaluate GPT-2 models trained on 1 million, 10 million, 100 million, or 1 billion words against an fMRI benchmark. We consider the 100-million-word model to be developmentally plausible in terms of the amount of training data given that this amount is similar to what children are estimated to be exposed to during the first 10 years of life. Second, we test the performance of a GPT-2 model trained on a 9-billion-token dataset to reach state-of-the-art next-word prediction performance on the human benchmark at different stages during training. Across both approaches, we find that (i) the models trained on a developmentally plausible amount of data already achieve near-maximal performance in capturing fMRI responses to sentences. Further, (ii) lower perplexity-a measure of next-word prediction performance-is associated with stronger alignment with human data, suggesting that models that have received enough training to achieve sufficiently high next-word prediction performance also acquire representations of sentences that are predictive of human fMRI responses. In tandem, these findings establish that although some training is necessary for the models' predictive ability, a developmentally realistic amount of training (∼100 million words) may suffice.
Collapse
Affiliation(s)
- Eghbal A. Hosseini
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Martin Schrimpf
- The MIT Quest for Intelligence Initiative, Cambridge, MA, USA
- Swiss Federal Institute of Technology, Lausanne, Switzerland
| | - Yian Zhang
- Computer Science Department, Stanford University, Stanford, CA, USA
| | - Samuel Bowman
- Center for Data Science, New York University, New York, NY, USA
- Department of Linguistics, New York University, New York, NY, USA
- Department of Computer Science, New York University, New York, NY, USA
| | - Noga Zaslavsky
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- K. Lisa Yang Integrative Computational Neuroscience (ICoN) Center, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Language Science, University of California, Irvine, CA, USA
| | - Evelina Fedorenko
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- The MIT Quest for Intelligence Initiative, Cambridge, MA, USA
- Speech and Hearing Bioscience and Technology Program, Harvard University, Boston, MA, USA
| |
Collapse
|
4
|
Brady TF, Störmer VS. Comparing memory capacity across stimuli requires maximally dissimilar foils: Using deep convolutional neural networks to understand visual working memory capacity for real-world objects. Mem Cognit 2024; 52:595-609. [PMID: 37973770 DOI: 10.3758/s13421-023-01485-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/17/2023] [Indexed: 11/19/2023]
Abstract
The capacity of visual working and visual long-term memory plays a critical role in theories of cognitive architecture and the relationship between memory and other cognitive systems. Here, we argue that before asking the question of how capacity varies across different stimuli or what the upper bound of capacity is for a given memory system, it is necessary to establish a methodology that allows a fair comparison between distinct stimulus sets and conditions. One of the most important factors determining performance in a memory task is target/foil dissimilarity. We argue that only by maximizing the dissimilarity of the target and foil in each stimulus set can we provide a fair basis for memory comparisons between stimuli. In the current work we focus on a way to pick such foils objectively for complex, meaningful real-world objects by using deep convolutional neural networks, and we validate this using both memory tests and similarity metrics. Using this method, we then provide evidence that there is a greater capacity for real-world objects relative to simple colors in visual working memory; critically, we also show that this difference can be reduced or eliminated when non-comparable foils are used, potentially explaining why previous work has not always found such a difference. Our study thus demonstrates that working memory capacity depends on the type of information that is remembered and that assessing capacity depends critically on foil dissimilarity, especially when comparing memory performance and other cognitive systems across different stimulus sets.
Collapse
Affiliation(s)
- Timothy F Brady
- Department of Psychology, University of California San Diego, La Jolla, CA, 92093, USA.
| | - Viola S Störmer
- Department of Psychological and Brain Sciences, Dartmouth College, Hanover, NH, USA
| |
Collapse
|
5
|
Loke J, Seijdel N, Snoek L, Sörensen LKA, van de Klundert R, van der Meer M, Quispel E, Cappaert N, Scholte HS. Human Visual Cortex and Deep Convolutional Neural Network Care Deeply about Object Background. J Cogn Neurosci 2024; 36:551-566. [PMID: 38165735 DOI: 10.1162/jocn_a_02098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2024]
Abstract
Deep convolutional neural networks (DCNNs) are able to partially predict brain activity during object categorization tasks, but factors contributing to this predictive power are not fully understood. Our study aimed to investigate the factors contributing to the predictive power of DCNNs in object categorization tasks. We compared the activity of four DCNN architectures with EEG recordings obtained from 62 human participants during an object categorization task. Previous physiological studies on object categorization have highlighted the importance of figure-ground segregation-the ability to distinguish objects from their backgrounds. Therefore, we investigated whether figure-ground segregation could explain the predictive power of DCNNs. Using a stimulus set consisting of identical target objects embedded in different backgrounds, we examined the influence of object background versus object category within both EEG and DCNN activity. Crucially, the recombination of naturalistic objects and experimentally controlled backgrounds creates a challenging and naturalistic task, while retaining experimental control. Our results showed that early EEG activity (< 100 msec) and early DCNN layers represent object background rather than object category. We also found that the ability of DCNNs to predict EEG activity is primarily influenced by how both systems process object backgrounds, rather than object categories. We demonstrated the role of figure-ground segregation as a potential prerequisite for recognition of object features, by contrasting the activations of trained and untrained (i.e., random weights) DCNNs. These findings suggest that both human visual cortex and DCNNs prioritize the segregation of object backgrounds and target objects to perform object categorization. Altogether, our study provides new insights into the mechanisms underlying object categorization as we demonstrated that both human visual cortex and DCNNs care deeply about object background.
Collapse
|
6
|
Tuckute G, Feather J, Boebinger D, McDermott JH. Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. PLoS Biol 2023; 21:e3002366. [PMID: 38091351 PMCID: PMC10718467 DOI: 10.1371/journal.pbio.3002366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 10/06/2023] [Indexed: 12/18/2023] Open
Abstract
Models that predict brain responses to stimuli provide one measure of understanding of a sensory system and have many potential applications in science and engineering. Deep artificial neural networks have emerged as the leading such predictive models of the visual system but are less explored in audition. Prior work provided examples of audio-trained neural networks that produced good predictions of auditory cortical fMRI responses and exhibited correspondence between model stages and brain regions, but left it unclear whether these results generalize to other neural network models and, thus, how to further improve models in this domain. We evaluated model-brain correspondence for publicly available audio neural network models along with in-house models trained on 4 different tasks. Most tested models outpredicted standard spectromporal filter-bank models of auditory cortex and exhibited systematic model-brain correspondence: Middle stages best predicted primary auditory cortex, while deep stages best predicted non-primary cortex. However, some state-of-the-art models produced substantially worse brain predictions. Models trained to recognize speech in background noise produced better brain predictions than models trained to recognize speech in quiet, potentially because hearing in noise imposes constraints on biological auditory representations. The training task influenced the prediction quality for specific cortical tuning properties, with best overall predictions resulting from models trained on multiple tasks. The results generally support the promise of deep neural networks as models of audition, though they also indicate that current models do not explain auditory cortical responses in their entirety.
Collapse
Affiliation(s)
- Greta Tuckute
- Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, United States of America
| | - Jenelle Feather
- Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, United States of America
| | - Dana Boebinger
- Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, United States of America
- Program in Speech and Hearing Biosciences and Technology, Harvard, Cambridge, Massachusetts, United States of America
- University of Rochester Medical Center, Rochester, New York, New York, United States of America
| | - Josh H. McDermott
- Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, United States of America
- Program in Speech and Hearing Biosciences and Technology, Harvard, Cambridge, Massachusetts, United States of America
| |
Collapse
|
7
|
Singer Y, Taylor L, Willmore BDB, King AJ, Harper NS. Hierarchical temporal prediction captures motion processing along the visual pathway. eLife 2023; 12:e52599. [PMID: 37844199 PMCID: PMC10629830 DOI: 10.7554/elife.52599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 10/04/2023] [Indexed: 10/18/2023] Open
Abstract
Visual neurons respond selectively to features that become increasingly complex from the eyes to the cortex. Retinal neurons prefer flashing spots of light, primary visual cortical (V1) neurons prefer moving bars, and those in higher cortical areas favor complex features like moving textures. Previously, we showed that V1 simple cell tuning can be accounted for by a basic model implementing temporal prediction - representing features that predict future sensory input from past input (Singer et al., 2018). Here, we show that hierarchical application of temporal prediction can capture how tuning properties change across at least two levels of the visual system. This suggests that the brain does not efficiently represent all incoming information; instead, it selectively represents sensory inputs that help in predicting the future. When applied hierarchically, temporal prediction extracts time-varying features that depend on increasingly high-level statistics of the sensory input.
Collapse
Affiliation(s)
- Yosef Singer
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| | - Luke Taylor
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| | - Ben DB Willmore
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| | - Andrew J King
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| | - Nicol S Harper
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| |
Collapse
|
8
|
Taylor J, Kriegeskorte N. Extracting and visualizing hidden activations and computational graphs of PyTorch models with TorchLens. Sci Rep 2023; 13:14375. [PMID: 37658079 PMCID: PMC10474256 DOI: 10.1038/s41598-023-40807-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Accepted: 08/16/2023] [Indexed: 09/03/2023] Open
Abstract
Deep neural network models (DNNs) are essential to modern AI and provide powerful models of information processing in biological neural networks. Researchers in both neuroscience and engineering are pursuing a better understanding of the internal representations and operations that undergird the successes and failures of DNNs. Neuroscientists additionally evaluate DNNs as models of brain computation by comparing their internal representations to those found in brains. It is therefore essential to have a method to easily and exhaustively extract and characterize the results of the internal operations of any DNN. Many models are implemented in PyTorch, the leading framework for building DNN models. Here we introduce TorchLens, a new open-source Python package for extracting and characterizing hidden-layer activations in PyTorch models. Uniquely among existing approaches to this problem, TorchLens has the following features: (1) it exhaustively extracts the results of all intermediate operations, not just those associated with PyTorch module objects, yielding a full record of every step in the model's computational graph, (2) it provides an intuitive visualization of the model's complete computational graph along with metadata about each computational step in a model's forward pass for further analysis, (3) it contains a built-in validation procedure to algorithmically verify the accuracy of all saved hidden-layer activations, and (4) the approach it uses can be automatically applied to any PyTorch model with no modifications, including models with conditional (if-then) logic in their forward pass, recurrent models, branching models where layer outputs are fed into multiple subsequent layers in parallel, and models with internally generated tensors (e.g., injections of noise). Furthermore, using TorchLens requires minimal additional code, making it easy to incorporate into existing pipelines for model development and analysis, and useful as a pedagogical aid when teaching deep learning concepts. We hope this contribution will help researchers in AI and neuroscience understand the internal representations of DNNs.
Collapse
Affiliation(s)
- JohnMark Taylor
- Zuckerman Mind Brain Behavior Institute, Columbia University, 3227 Broadway, New York, NY, 10027, USA.
| | - Nikolaus Kriegeskorte
- Zuckerman Mind Brain Behavior Institute, Columbia University, 3227 Broadway, New York, NY, 10027, USA
| |
Collapse
|
9
|
Schütt HH, Kipnis AD, Diedrichsen J, Kriegeskorte N. Statistical inference on representational geometries. eLife 2023; 12:e82566. [PMID: 37610302 PMCID: PMC10446828 DOI: 10.7554/elife.82566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 08/07/2023] [Indexed: 08/24/2023] Open
Abstract
Neuroscience has recently made much progress, expanding the complexity of both neural activity measurements and brain-computational models. However, we lack robust methods for connecting theory and experiment by evaluating our new big models with our new big data. Here, we introduce new inference methods enabling researchers to evaluate and compare models based on the accuracy of their predictions of representational geometries: A good model should accurately predict the distances among the neural population representations (e.g. of a set of stimuli). Our inference methods combine novel 2-factor extensions of crossvalidation (to prevent overfitting to either subjects or conditions from inflating our estimates of model accuracy) and bootstrapping (to enable inferential model comparison with simultaneous generalization to both new subjects and new conditions). We validate the inference methods on data where the ground-truth model is known, by simulating data with deep neural networks and by resampling of calcium-imaging and functional MRI data. Results demonstrate that the methods are valid and conclusions generalize correctly. These data analysis methods are available in an open-source Python toolbox (rsatoolbox.readthedocs.io).
Collapse
Affiliation(s)
- Heiko H Schütt
- Zuckerman Institute, Columbia UniversityNew YorkUnited States
| | | | | | | |
Collapse
|
10
|
Farzmahdi A, Zarco W, Freiwald W, Kriegeskorte N, Golan T. Emergence of brain-like mirror-symmetric viewpoint tuning in convolutional neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.05.522909. [PMID: 36711779 PMCID: PMC9881894 DOI: 10.1101/2023.01.05.522909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Primates can recognize objects despite 3D geometric variations such as in-depth rotations. The computational mechanisms that give rise to such invariances are yet to be fully understood. A curious case of partial invariance occurs in the macaque face-patch AL and in fully connected layers of deep convolutional networks in which neurons respond similarly to mirror-symmetric views (e.g., left and right profiles). Why does this tuning develop? Here, we propose a simple learning-driven explanation for mirror-symmetric viewpoint tuning. We show that mirror-symmetric viewpoint tuning for faces emerges in the fully connected layers of convolutional deep neural networks trained on object recognition tasks, even when the training dataset does not include faces. First, using 3D objects rendered from multiple views as test stimuli, we demonstrate that mirror-symmetric viewpoint tuning in convolutional neural network models is not unique to faces: it emerges for multiple object categories with bilateral symmetry. Second, we show why this invariance emerges in the models. Learning to discriminate among bilaterally symmetric object categories induces reflection-equivariant intermediate representations. AL-like mirror-symmetric tuning is achieved when such equivariant responses are spatially pooled by downstream units with sufficiently large receptive fields. These results explain how mirror-symmetric viewpoint tuning can emerge in neural networks, providing a theory of how they might emerge in the primate brain. Our theory predicts that mirror-symmetric viewpoint tuning can emerge as a consequence of exposure to bilaterally symmetric objects beyond the category of faces, and that it can generalize beyond previously experienced object categories.
Collapse
|
11
|
Kabulska Z, Lingnau A. The cognitive structure underlying the organization of observed actions. Behav Res Methods 2023; 55:1890-1906. [PMID: 35788973 PMCID: PMC10250259 DOI: 10.3758/s13428-022-01894-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/26/2022] [Indexed: 11/08/2022]
Abstract
In daily life, we frequently encounter actions performed by other people. Here we aimed to examine the key categories and features underlying the organization of a wide range of actions in three behavioral experiments (N = 378 participants). In Experiment 1, we used a multi-arrangement task of 100 different actions. Inverse multidimensional scaling and hierarchical clustering revealed 11 action categories, including Locomotion, Communication, and Aggressive actions. In Experiment 2, we used a feature-listing paradigm to obtain a wide range of action features that were subsequently reduced to 59 key features and used in a rating study (Experiment 3). A direct comparison of the feature ratings obtained in Experiment 3 between actions belonging to the categories identified in Experiment 1 revealed a number of features that appear to be critical for the distinction between these categories, e.g., the features Harm and Noise for the category Aggressive actions, and the features Targeting a person and Contact with others for the category Interaction. Finally, we found that a part of the category-based organization is explained by a combination of weighted features, whereas a significant proportion of variability remained unexplained, suggesting that there are additional sources of information that contribute to the categorization of observed actions. The characterization of action categories and their associated features serves as an important extension of previous studies examining the cognitive structure of actions. Moreover, our results may serve as the basis for future behavioral, neuroimaging and computational modeling studies.
Collapse
Affiliation(s)
- Zuzanna Kabulska
- Department of Psychology, Faculty of Human Sciences, University of Regensburg, Universitätsstraße 31, 93053, Regensburg, Germany
| | - Angelika Lingnau
- Department of Psychology, Faculty of Human Sciences, University of Regensburg, Universitätsstraße 31, 93053, Regensburg, Germany.
| |
Collapse
|
12
|
Sandbrink KJ, Mamidanna P, Michaelis C, Bethge M, Mathis MW, Mathis A. Contrasting action and posture coding with hierarchical deep neural network models of proprioception. eLife 2023; 12:e81499. [PMID: 37254843 PMCID: PMC10361732 DOI: 10.7554/elife.81499] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 05/16/2023] [Indexed: 06/01/2023] Open
Abstract
Biological motor control is versatile, efficient, and depends on proprioceptive feedback. Muscles are flexible and undergo continuous changes, requiring distributed adaptive control mechanisms that continuously account for the body's state. The canonical role of proprioception is representing the body state. We hypothesize that the proprioceptive system could also be critical for high-level tasks such as action recognition. To test this theory, we pursued a task-driven modeling approach, which allowed us to isolate the study of proprioception. We generated a large synthetic dataset of human arm trajectories tracing characters of the Latin alphabet in 3D space, together with muscle activities obtained from a musculoskeletal model and model-based muscle spindle activity. Next, we compared two classes of tasks: trajectory decoding and action recognition, which allowed us to train hierarchical models to decode either the position and velocity of the end-effector of one's posture or the character (action) identity from the spindle firing patterns. We found that artificial neural networks could robustly solve both tasks, and the networks' units show tuning properties similar to neurons in the primate somatosensory cortex and the brainstem. Remarkably, we found uniformly distributed directional selective units only with the action-recognition-trained models and not the trajectory-decoding-trained models. This suggests that proprioceptive encoding is additionally associated with higher-level functions such as action recognition and therefore provides new, experimentally testable hypotheses of how proprioception aids in adaptive motor control.
Collapse
Affiliation(s)
- Kai J Sandbrink
- The Rowland Institute at Harvard, Harvard UniversityCambridgeUnited States
| | - Pranav Mamidanna
- Tübingen AI Center, Eberhard Karls Universität Tübingen & Institute for Theoretical PhysicsTübingenGermany
| | - Claudio Michaelis
- Tübingen AI Center, Eberhard Karls Universität Tübingen & Institute for Theoretical PhysicsTübingenGermany
| | - Matthias Bethge
- Tübingen AI Center, Eberhard Karls Universität Tübingen & Institute for Theoretical PhysicsTübingenGermany
| | - Mackenzie Weygandt Mathis
- The Rowland Institute at Harvard, Harvard UniversityCambridgeUnited States
- Brain Mind Institute, School of Life Sciences, École Polytechnique Fédérale de LausanneGenèveSwitzerland
| | - Alexander Mathis
- The Rowland Institute at Harvard, Harvard UniversityCambridgeUnited States
- Brain Mind Institute, School of Life Sciences, École Polytechnique Fédérale de LausanneGenèveSwitzerland
| |
Collapse
|
13
|
The Spatiotemporal Neural Dynamics of Object Recognition for Natural Images and Line Drawings. J Neurosci 2023; 43:484-500. [PMID: 36535769 PMCID: PMC9864561 DOI: 10.1523/jneurosci.1546-22.2022] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 11/18/2022] [Accepted: 11/30/2022] [Indexed: 12/24/2022] Open
Abstract
Drawings offer a simple and efficient way to communicate meaning. While line drawings capture only coarsely how objects look in reality, we still perceive them as resembling real-world objects. Previous work has shown that this perceived similarity is mirrored by shared neural representations for drawings and natural images, which suggests that similar mechanisms underlie the recognition of both. However, other work has proposed that representations of drawings and natural images become similar only after substantial processing has taken place, suggesting distinct mechanisms. To arbitrate between those alternatives, we measured brain responses resolved in space and time using fMRI and MEG, respectively, while human participants (female and male) viewed images of objects depicted as photographs, line drawings, or sketch-like drawings. Using multivariate decoding, we demonstrate that object category information emerged similarly fast and across overlapping regions in occipital, ventral-temporal, and posterior parietal cortex for all types of depiction, yet with smaller effects at higher levels of visual abstraction. In addition, cross-decoding between depiction types revealed strong generalization of object category information from early processing stages on. Finally, by combining fMRI and MEG data using representational similarity analysis, we found that visual information traversed similar processing stages for all types of depiction, yet with an overall stronger representation for photographs. Together, our results demonstrate broad commonalities in the neural dynamics of object recognition across types of depiction, thus providing clear evidence for shared neural mechanisms underlying recognition of natural object images and abstract drawings.SIGNIFICANCE STATEMENT When we see a line drawing, we effortlessly recognize it as an object in the world despite its simple and abstract style. Here we asked to what extent this correspondence in perception is reflected in the brain. To answer this question, we measured how neural processing of objects depicted as photographs and line drawings with varying levels of detail (from natural images to abstract line drawings) evolves over space and time. We find broad commonalities in the spatiotemporal dynamics and the neural representations underlying the perception of photographs and even abstract drawings. These results indicate a shared basic mechanism supporting recognition of drawings and natural images.
Collapse
|
14
|
Face dissimilarity judgments are predicted by representational distance in morphable and image-computable models. Proc Natl Acad Sci U S A 2022; 119:e2115047119. [PMID: 35767642 PMCID: PMC9271164 DOI: 10.1073/pnas.2115047119] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Discerning the subtle differences between individuals’ faces is crucial for social functioning. It requires us not only to solve general challenges of object recognition (e.g., invariant recognition over changes in view or lighting) but also to be attuned to the specific ways in which face structure varies. Three-dimensional morphable models based on principal component analyses of real faces provide descriptions of statistical differences between faces, as well as tools to generate novel faces. We rendered large sets of realistic face pairs from such a model and collected similarity and same/different identity judgments. The statistical model predicted human perception as well as state-of-the-art image-computable neural networks. Results underscore the statistical tuning of face encoding. Human vision is attuned to the subtle differences between individual faces. Yet we lack a quantitative way of predicting how similar two face images look and whether they appear to show the same person. Principal component–based three-dimensional (3D) morphable models are widely used to generate stimuli in face perception research. These models capture the distribution of real human faces in terms of dimensions of physical shape and texture. How well does a “face space” based on these dimensions capture the similarity relationships humans perceive among faces? To answer this, we designed a behavioral task to collect dissimilarity and same/different identity judgments for 232 pairs of realistic faces. Stimuli sampled geometric relationships in a face space derived from principal components of 3D shape and texture (Basel face model [BFM]). We then compared a wide range of models in their ability to predict the data, including the BFM from which faces were generated, an active appearance model derived from face photographs, and image-computable models of visual perception. Euclidean distance in the BFM explained both dissimilarity and identity judgments surprisingly well. In a comparison against 16 diverse models, BFM distance was competitive with representational distances in state-of-the-art deep neural networks (DNNs), including novel DNNs trained on BFM synthetic identities or BFM latents. Models capturing the distribution of face shape and texture across individuals are not only useful tools for stimulus generation. They also capture important information about how faces are perceived, suggesting that human face representations are tuned to the statistical distribution of faces.
Collapse
|
15
|
Sp A. Trailblazers in Neuroscience: Using compositionality to understand how parts combine in whole objects. Eur J Neurosci 2022; 56:4378-4392. [PMID: 35760552 PMCID: PMC10084036 DOI: 10.1111/ejn.15746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Revised: 06/09/2022] [Accepted: 06/16/2022] [Indexed: 11/27/2022]
Abstract
A fundamental question for any visual system is whether its image representation can be understood in terms of its components. Decomposing any image into components is challenging because there are many possible decompositions with no common dictionary, and enumerating them leads to a combinatorial explosion. Even in perception, many objects are readily seen as containing parts, but there are many exceptions. These exceptions include objects that are not perceived as containing parts, properties like symmetry that cannot be localized to any single part, and also special categories like words and faces whose perception is widely believed to be holistic. Here, I describe a novel approach we have used to address these issues and evaluate compositionality at the behavioral and neural levels. The key design principle is to create a large number of objects by combining a small number of pre-defined components in all possible ways. This allows for building component-based models that explain whole objects using a combination of these components. Importantly, any systematic error in model fits can be used to detect the presence of emergent or holistic properties. Using this approach, we have found that whole object representations are surprisingly predictable from their components, that some components are preferred to others in perception, and that emergent properties can be discovered or explained using compositional models. Thus, compositionality is a powerful approach for understanding how whole objects relate to their parts.
Collapse
Affiliation(s)
- Arun Sp
- Centre for Neuroscience, Indian Institute of Science Bangalore
| |
Collapse
|
16
|
Kiat JE, Luck SJ, Beckner AG, Hayes TR, Pomaranski KI, Henderson JM, Oakes LM. Linking patterns of infant eye movements to a neural network model of the ventral stream using representational similarity analysis. Dev Sci 2022; 25:e13155. [PMID: 34240787 PMCID: PMC8639751 DOI: 10.1111/desc.13155] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 06/23/2021] [Accepted: 07/01/2021] [Indexed: 01/03/2023]
Abstract
Little is known about the development of higher-level areas of visual cortex during infancy, and even less is known about how the development of visually guided behavior is related to the different levels of the cortical processing hierarchy. As a first step toward filling these gaps, we used representational similarity analysis (RSA) to assess links between gaze patterns and a neural network model that captures key properties of the ventral visual processing stream. We recorded the eye movements of 4- to 12-month-old infants (N = 54) as they viewed photographs of scenes. For each infant, we calculated the similarity of the gaze patterns for each pair of photographs. We also analyzed the images using a convolutional neural network model in which the successive layers correspond approximately to the sequence of areas along the ventral stream. For each layer of the network, we calculated the similarity of the activation patterns for each pair of photographs, which was then compared with the infant gaze data. We found that the network layers corresponding to lower-level areas of visual cortex accounted for gaze patterns better in younger infants than in older infants, whereas the network layers corresponding to higher-level areas of visual cortex accounted for gaze patterns better in older infants than in younger infants. Thus, between 4 and 12 months, gaze becomes increasingly controlled by more abstract, higher-level representations. These results also demonstrate the feasibility of using RSA to link infant gaze behavior to neural network models. A video abstract of this article can be viewed at https://youtu.be/K5mF2Rw98Is.
Collapse
|
17
|
Nonaka S, Majima K, Aoki SC, Kamitani Y. Brain hierarchy score: Which deep neural networks are hierarchically brain-like? iScience 2021; 24:103013. [PMID: 34522856 PMCID: PMC8426272 DOI: 10.1016/j.isci.2021.103013] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 12/31/2020] [Accepted: 08/18/2021] [Indexed: 11/16/2022] Open
Abstract
Achievement of human-level image recognition by deep neural networks (DNNs) has spurred interest in whether and how DNNs are brain-like. Both DNNs and the visual cortex perform hierarchical processing, and correspondence has been shown between hierarchical visual areas and DNN layers in representing visual features. Here, we propose the brain hierarchy (BH) score as a metric to quantify the degree of hierarchical correspondence based on neural decoding and encoding analyses where DNN unit activations and human brain activity are predicted from each other. We find that BH scores for 29 pre-trained DNNs with various architectures are negatively correlated with image recognition performance, thus indicating that recently developed high-performance DNNs are not necessarily brain-like. Experimental manipulations of DNN models suggest that single-path sequential feedforward architecture with broad spatial integration is critical to brain-like hierarchy. Our method may provide new ways to design DNNs in light of their representational homology to the brain. A measure for brain-like hierarchy is proposed to characterize DNNs Encoding/decoding with human fMRI quantifies the hierarchical correspondence Among representative DNN models, high-performance models are not brain-like Critical factors for brain-like hierarchy are explored
Collapse
Affiliation(s)
- Soma Nonaka
- Faculty of Integrated Human Studies, Kyoto University, Kyoto 606-8501, Japan
| | - Kei Majima
- Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
| | - Shuntaro C Aoki
- Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
| | - Yukiyasu Kamitani
- Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan.,ATR Computational Neuroscience Laboratories, Seika, Kyoto 619-0288, Japan
| |
Collapse
|
18
|
Muttenthaler L, Hebart MN. THINGSvision: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks. Front Neuroinform 2021; 15:679838. [PMID: 34630062 PMCID: PMC8494008 DOI: 10.3389/fninf.2021.679838] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Accepted: 08/10/2021] [Indexed: 11/25/2022] Open
Abstract
Over the past decade, deep neural network (DNN) models have received a lot of attention due to their near-human object classification performance and their excellent prediction of signals recorded from biological visual systems. To better understand the function of these networks and relate them to hypotheses about brain activity and behavior, researchers need to extract the activations to images across different DNN layers. The abundance of different DNN variants, however, can often be unwieldy, and the task of extracting DNN activations from different layers may be non-trivial and error-prone for someone without a strong computational background. Thus, researchers in the fields of cognitive science and computational neuroscience would benefit from a library or package that supports a user in the extraction task. THINGSvision is a new Python module that aims at closing this gap by providing a simple and unified tool for extracting layer activations for a wide range of pretrained and randomly-initialized neural network architectures, even for users with little to no programming experience. We demonstrate the general utility of THINGsvision by relating extracted DNN activations to a number of functional MRI and behavioral datasets using representational similarity analysis, which can be performed as an integral part of the toolbox. Together, THINGSvision enables researchers across diverse fields to extract features in a streamlined manner for their custom image dataset, thereby improving the ease of relating DNNs, brain activity, and behavior, and improving the reproducibility of findings in these research fields.
Collapse
Affiliation(s)
- Lukas Muttenthaler
- Vision and Computational Cognition Group, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
| | - Martin N. Hebart
- Vision and Computational Cognition Group, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
| |
Collapse
|
19
|
Lonnqvist B, Bornet A, Doerig A, Herzog MH. A comparative biology approach to DNN modeling of vision: A focus on differences, not similarities. J Vis 2021; 21:17. [PMID: 34551062 PMCID: PMC8475290 DOI: 10.1167/jov.21.10.17] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Accepted: 08/26/2021] [Indexed: 11/24/2022] Open
Abstract
Deep neural networks (DNNs) have revolutionized computer science and are now widely used for neuroscientific research. A hot debate has ensued about the usefulness of DNNs as neuroscientific models of the human visual system; the debate centers on to what extent certain shortcomings of DNNs are real failures and to what extent they are redeemable. Here, we argue that the main problem is that we often do not understand which human functions need to be modeled and, thus, what counts as a falsification. Hence, not only is there a problem on the DNN side, but there is also one on the brain side (i.e., with the explanandum-the thing to be explained). For example, should DNNs reproduce illusions? We posit that we can make better use of DNNs by adopting an approach of comparative biology by focusing on the differences, rather than the similarities, between DNNs and humans to improve our understanding of visual information processing in general.
Collapse
Affiliation(s)
- Ben Lonnqvist
- Laboratory of Psychophysics, Brain Mind Institute, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Alban Bornet
- Laboratory of Psychophysics, Brain Mind Institute, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Adrien Doerig
- Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
| | - Michael H Herzog
- Laboratory of Psychophysics, Brain Mind Institute, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| |
Collapse
|
20
|
A Visual Encoding Model Based on Contrastive Self-Supervised Learning for Human Brain Activity along the Ventral Visual Stream. Brain Sci 2021; 11:brainsci11081004. [PMID: 34439623 PMCID: PMC8391143 DOI: 10.3390/brainsci11081004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 07/23/2021] [Accepted: 07/26/2021] [Indexed: 11/30/2022] Open
Abstract
Visual encoding models are important computational models for understanding how information is processed along the visual stream. Many improved visual encoding models have been developed from the perspective of the model architecture and the learning objective, but these are limited to the supervised learning method. From the view of unsupervised learning mechanisms, this paper utilized a pre-trained neural network to construct a visual encoding model based on contrastive self-supervised learning for the ventral visual stream measured by functional magnetic resonance imaging (fMRI). We first extracted features using the ResNet50 model pre-trained in contrastive self-supervised learning (ResNet50-CSL model), trained a linear regression model for each voxel, and finally calculated the prediction accuracy of different voxels. Compared with the ResNet50 model pre-trained in a supervised classification task, the ResNet50-CSL model achieved an equal or even relatively better encoding performance in multiple visual cortical areas. Moreover, the ResNet50-CSL model performs hierarchical representation of input visual stimuli, which is similar to the human visual cortex in its hierarchical information processing. Our experimental results suggest that the encoding model based on contrastive self-supervised learning is a strong computational model to compete with supervised models, and contrastive self-supervised learning proves an effective learning method to extract human brain-like representations.
Collapse
|