1
|
Stewart EEM, Fleming RW, Schütz AC. A simple optical flow model explains why certain object viewpoints are special. Proc Biol Sci 2024; 291:20240577. [PMID: 38981528 PMCID: PMC11334996 DOI: 10.1098/rspb.2024.0577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 06/13/2024] [Accepted: 06/13/2024] [Indexed: 07/11/2024] Open
Abstract
A core challenge in perception is recognizing objects across the highly variable retinal input that occurs when objects are viewed from different directions (e.g. front versus side views). It has long been known that certain views are of particular importance, but it remains unclear why. We reasoned that characterizing the computations underlying visual comparisons between objects could explain the privileged status of certain qualitatively special views. We measured pose discrimination for a wide range of objects, finding large variations in performance depending on the object and the viewing angle, with front and back views yielding particularly good discrimination. Strikingly, a simple and biologically plausible computational model based on measuring the projected three-dimensional optical flow between views of objects accurately predicted both successes and failures of discrimination performance. This provides a computational account of why certain views have a privileged status.
Collapse
Affiliation(s)
- Emma E. M. Stewart
- School of Biological and Behavioural Sciences, Queen Mary University London, LondonE14NS, UK
- Department of Experimental and Biological Psychology, Queen Mary University London, LondonE14NS, UK
- Centre for Brain and Behaviour, Queen Mary University London, LondonE14NS, UK
| | - Roland W. Fleming
- Department of Experimental Psychology, Justus Liebig University Giessen, Giessen35394, Germany
- Centre for Mind, Brain, and Behaviour (CMBB), University of Marburg and Justus Liebig University Giessen, Giessen35032, Germany
| | - Alexander C. Schütz
- Centre for Mind, Brain, and Behaviour (CMBB), University of Marburg and Justus Liebig University Giessen, Giessen35032, Germany
- General and Experimental Psychology, University of Marburg, Marburg35032, Germany
| |
Collapse
|
2
|
Liao C, Sawayama M, Xiao B. Probing the Link Between Vision and Language in Material Perception Using Psychophysics and Unsupervised Learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.25.577219. [PMID: 38328102 PMCID: PMC10849714 DOI: 10.1101/2024.01.25.577219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
We can visually discriminate and recognize a wide range of materials. Meanwhile, we use language to express our subjective understanding of visual input and communicate relevant information about the materials. Here, we investigate the relationship between visual judgment and language expression in material perception to understand how visual features relate to semantic representations. We use deep generative networks to construct an expandable image space to systematically create materials of well-defined and ambiguous categories. From such a space, we sampled diverse stimuli and compared the representations of materials from two behavioral tasks: visual material similarity judgments and free-form verbal descriptions. Our findings reveal a moderate but significant correlation between vision and language on a categorical level. However, analyzing the representations with an unsupervised alignment method, we discover structural differences that arise at the image-to-image level, especially among materials morphed between known categories. Moreover, visual judgments exhibit more individual differences compared to verbal descriptions. Our results show that while verbal descriptions capture material qualities on the coarse level, they may not fully convey the visual features that characterize the material's optical properties. Analyzing the image representation of materials obtained from various pre-trained data-rich deep neural networks, we find that human visual judgments' similarity structures align more closely with those of the text-guided visual-semantic model than purely vision-based models. Our findings suggest that while semantic representations facilitate material categorization, non-semantic visual features also play a significant role in discriminating materials at a finer level. This work illustrates the need to consider the vision-language relationship in building a comprehensive model for material perception. Moreover, we propose a novel framework for quantitatively evaluating the alignment and misalignment between representations from different modalities, leveraging information from human behaviors and computational models.
Collapse
Affiliation(s)
- Chenxi Liao
- American University, Department of Neuroscience, Washington DC, 20016, USA
| | - Masataka Sawayama
- The University of Tokyo, Graduate School of Information Science and Technology, Tokyo, 113-0033, Japan
| | - Bei Xiao
- American University, Department of Computer Science, Washington, DC, 20016, USA
| |
Collapse
|
3
|
Abstract
Deep neural networks (DNNs) are machine learning algorithms that have revolutionized computer vision due to their remarkable successes in tasks like object classification and segmentation. The success of DNNs as computer vision algorithms has led to the suggestion that DNNs may also be good models of human visual perception. In this article, we review evidence regarding current DNNs as adequate behavioral models of human core object recognition. To this end, we argue that it is important to distinguish between statistical tools and computational models and to understand model quality as a multidimensional concept in which clarity about modeling goals is key. Reviewing a large number of psychophysical and computational explorations of core object recognition performance in humans and DNNs, we argue that DNNs are highly valuable scientific tools but that, as of today, DNNs should only be regarded as promising-but not yet adequate-computational models of human core object recognition behavior. On the way, we dispel several myths surrounding DNNs in vision science.
Collapse
Affiliation(s)
- Felix A Wichmann
- Neural Information Processing Group, University of Tübingen, Tübingen, Germany;
| | | |
Collapse
|
4
|
Schmid AC, Barla P, Doerschner K. Material category of visual objects computed from specular image structure. Nat Hum Behav 2023:10.1038/s41562-023-01601-0. [PMID: 37386108 PMCID: PMC10365995 DOI: 10.1038/s41562-023-01601-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 04/14/2023] [Indexed: 07/01/2023]
Abstract
Recognizing materials and their properties visually is vital for successful interactions with our environment, from avoiding slippery floors to handling fragile objects. Yet there is no simple mapping of retinal image intensities to physical properties. Here, we investigated what image information drives material perception by collecting human psychophysical judgements about complex glossy objects. Variations in specular image structure-produced either by manipulating reflectance properties or visual features directly-caused categorical shifts in material appearance, suggesting that specular reflections provide diagnostic information about a wide range of material classes. Perceived material category appeared to mediate cues for surface gloss, providing evidence against a purely feedforward view of neural processing. Our results suggest that the image structure that triggers our perception of surface gloss plays a direct role in visual categorization, and that the perception and neural processing of stimulus properties should be studied in the context of recognition, not in isolation.
Collapse
Affiliation(s)
- Alexandra C Schmid
- Department of Psychology, Justus Liebig University Giessen, Giessen, Germany.
| | | | - Katja Doerschner
- Department of Psychology, Justus Liebig University Giessen, Giessen, Germany
| |
Collapse
|
5
|
Domini F. The case against probabilistic inference: a new deterministic theory of 3D visual processing. Philos Trans R Soc Lond B Biol Sci 2023; 378:20210458. [PMID: 36511407 PMCID: PMC9745883 DOI: 10.1098/rstb.2021.0458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 10/03/2022] [Indexed: 12/15/2022] Open
Abstract
How the brain derives 3D information from inherently ambiguous visual input remains the fundamental question of human vision. The past two decades of research have addressed this question as a problem of probabilistic inference, the dominant model being maximum-likelihood estimation (MLE). This model assumes that independent depth-cue modules derive noisy but statistically accurate estimates of 3D scene parameters that are combined through a weighted average. Cue weights are adjusted based on the system representation of each module's output variability. Here I demonstrate that the MLE model fails to account for important psychophysical findings and, importantly, misinterprets the just noticeable difference, a hallmark measure of stimulus discriminability, to be an estimate of perceptual uncertainty. I propose a new theory, termed Intrinsic Constraint, which postulates that the visual system does not derive the most probable interpretation of the visual input, but rather, the most stable interpretation amid variations in viewing conditions. This goal is achieved with the Vector Sum model, which represents individual cue estimates as components of a multi-dimensional vector whose norm determines the combined output. This model accounts for the psychophysical findings cited in support of MLE, while predicting existing and new findings that contradict the MLE model. This article is part of a discussion meeting issue 'New approaches to 3D vision'.
Collapse
Affiliation(s)
- Fulvio Domini
- CLPS, Brown University, 190 Thayer Street Providence, Rhode Island 02912-9067, USA
| |
Collapse
|
6
|
Tamura H, Prokott KE, Fleming RW. Distinguishing mirror from glass: A "big data" approach to material perception. J Vis 2022; 22:4. [PMID: 35266961 PMCID: PMC8934559 DOI: 10.1167/jov.22.4.4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Distinguishing mirror from glass is a challenging visual inference, because both materials derive their appearance from their surroundings, yet we rarely experience difficulties in telling them apart. Very few studies have investigated how the visual system distinguishes reflections from refractions and to date, there is no image-computable model that emulates human judgments. Here we sought to develop a deep neural network that reproduces the patterns of visual judgments human observers make. To do this, we trained thousands of convolutional neural networks on more than 750,000 simulated mirror and glass objects, and compared their performance with human judgments, as well as alternative classifiers based on "hand-engineered" image features. For randomly chosen images, all classifiers and humans performed with high accuracy, and therefore correlated highly with one another. However, to assess how similar models are to humans, it is not sufficient to compare accuracy or correlation on random images. A good model should also predict the characteristic errors that humans make. We, therefore, painstakingly assembled a diagnostic image set for which humans make systematic errors, allowing us to isolate signatures of human-like performance. A large-scale, systematic search through feedforward neural architectures revealed that relatively shallow (three-layer) networks predicted human judgments better than any other models we tested. This is the first image-computable model that emulates human errors and succeeds in distinguishing mirror from glass, and hints that mid-level visual processing might be particularly important for the task.
Collapse
Affiliation(s)
- Hideki Tamura
- Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi, Aichi, Japan.,
| | - Konrad Eugen Prokott
- Department of Experimental Psychology, Justus Liebig University Giessen, Giessen, Germany.,
| | - Roland W Fleming
- Department of Experimental Psychology, Justus Liebig University Giessen, Giessen, Germany.,Center for Mind, Brain and Behavior (CMBB), University of Marburg and Justus Liebig University Giessen, Germany.,
| |
Collapse
|
7
|
Kunsberg B, Zucker SW. From boundaries to bumps: When closed (extremal) contours are critical. J Vis 2021; 21:7. [PMID: 34913951 PMCID: PMC8684304 DOI: 10.1167/jov.21.13.7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Accepted: 10/11/2021] [Indexed: 11/24/2022] Open
Abstract
Invariants underlying shape inference are elusive: A variety of shapes can give rise to the same image, and a variety of images can be rendered from the same shape. The occluding contour is a rare exception: It has both image salience, in terms of isophotes, and surface meaning, in terms of surface normal. We relax the notion of occluding contour and, more accurately, the rim on the object that projects to it, to define closed extremal curves. This new shape descriptor is invariant over different renderings. It exists at the topological level, which guarantees an image-based counterpart. It surrounds bumps and dents, as well as common interior shape components, and formalizes the qualitative nature of bump perception. The invariants are biologically computable, unify shape inferences from shading and specular materials, and predict new phenomena in bump and dent perception. Most important, working at the topological level allows us to capture the elusive aspect of bump boundaries.
Collapse
Affiliation(s)
| | - Steven W Zucker
- Computer Science, Biomedical Engineering, Yale University, New Haven, CT, USA
| |
Collapse
|