1
|
Webb TW, Frankland SM, Altabaa A, Segert S, Krishnamurthy K, Campbell D, Russin J, Giallanza T, O'Reilly R, Lafferty J, Cohen JD. The relational bottleneck as an inductive bias for efficient abstraction. Trends Cogn Sci 2024:S1364-6613(24)00080-9. [PMID: 38729852 DOI: 10.1016/j.tics.2024.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 03/29/2024] [Accepted: 04/01/2024] [Indexed: 05/12/2024]
Abstract
A central challenge for cognitive science is to explain how abstract concepts are acquired from limited experience. This has often been framed in terms of a dichotomy between connectionist and symbolic cognitive models. Here, we highlight a recently emerging line of work that suggests a novel reconciliation of these approaches, by exploiting an inductive bias that we term the relational bottleneck. In that approach, neural networks are constrained via their architecture to focus on relations between perceptual inputs, rather than the attributes of individual inputs. We review a family of models that employ this approach to induce abstractions in a data-efficient manner, emphasizing their potential as candidate models for the acquisition of abstract concepts in the human mind and brain.
Collapse
|
2
|
Depeweg S, Rothkopf CA, Jäkel F. Solving Bongard Problems With a Visual Language and Pragmatic Constraints. Cogn Sci 2024; 48:e13432. [PMID: 38700123 DOI: 10.1111/cogs.13432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 02/15/2024] [Accepted: 02/26/2024] [Indexed: 05/05/2024]
Abstract
More than 50 years ago, Bongard introduced 100 visual concept learning problems as a challenge for artificial vision systems. These problems are now known as Bongard problems. Although they are well known in cognitive science and artificial intelligence, only very little progress has been made toward building systems that can solve a substantial subset of them. In the system presented here, visual features are extracted through image processing and then translated into a symbolic visual vocabulary. We introduce a formal language that allows representing compositional visual concepts based on this vocabulary. Using this language and Bayesian inference, concepts can be induced from the examples that are provided in each problem. We find a reasonable agreement between the concepts with high posterior probability and the solutions formulated by Bongard himself for a subset of 35 problems. While this approach is far from solving Bongard problems like humans, it does considerably better than previous approaches. We discuss the issues we encountered while developing this system and their continuing relevance for understanding visual cognition. For instance, contrary to other concept learning problems, the examples are not random in Bongard problems; instead they are carefully chosen to ensure that the concept can be induced, and we found it helpful to take the resulting pragmatic constraints into account.
Collapse
Affiliation(s)
| | - Contantin A Rothkopf
- Centre for Cognitive Science & Institute of Psychology, Technische Universität Darmstadt
- Frankfurt Institute for Advanced Studies, Frankfurt am Main
| | - Frank Jäkel
- Centre for Cognitive Science & Institute of Psychology, Technische Universität Darmstadt
| |
Collapse
|
3
|
Linsley D, Serre T. Fixing the problems of deep neural networks will require better training data and learning algorithms. Behav Brain Sci 2023; 46:e400. [PMID: 38054333 DOI: 10.1017/s0140525x23001589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
Bowers et al. argue that deep neural networks (DNNs) are poor models of biological vision because they often learn to rival human accuracy by relying on strategies that differ markedly from those of humans. We show that this problem is worsening as DNNs are becoming larger-scale and increasingly more accurate, and prescribe methods for building DNNs that can reliably model biological vision.
Collapse
Affiliation(s)
- Drew Linsley
- Department of Cognitive Linguistic & Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, RI, USA ://sites.brown.edu/drewlinsleyhttps://serre-lab.clps.brown.edu
| | - Thomas Serre
- Department of Cognitive Linguistic & Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, RI, USA ://sites.brown.edu/drewlinsleyhttps://serre-lab.clps.brown.edu
| |
Collapse
|
4
|
Holzinger A, Saranti A, Angerschmid A, Finzel B, Schmid U, Mueller H. Toward human-level concept learning: Pattern benchmarking for AI algorithms. PATTERNS (NEW YORK, N.Y.) 2023; 4:100788. [PMID: 37602217 PMCID: PMC10435961 DOI: 10.1016/j.patter.2023.100788] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/22/2023]
Abstract
Artificial intelligence (AI) today is very successful at standard pattern-recognition tasks due to the availability of large amounts of data and advances in statistical data-driven machine learning. However, there is still a large gap between AI pattern recognition and human-level concept learning. Humans can learn amazingly well even under uncertainty from just a few examples and are capable of generalizing these concepts to solve new conceptual problems. The growing interest in explainable machine intelligence requires experimental environments and diagnostic/benchmark datasets to analyze existing approaches and drive progress in pattern analysis and machine intelligence. In this paper, we provide an overview of current AI solutions for benchmarking concept learning, reasoning, and generalization; discuss the state-of-the-art of existing diagnostic/benchmark datasets (such as CLEVR, CLEVRER, CLOSURE, CURI, Bongard-LOGO, V-PROM, RAVEN, Kandinsky Patterns, CLEVR-Humans, CLEVRER-Humans, and their extension containing human language); and provide an outlook of some future research directions in this exciting research domain.
Collapse
Affiliation(s)
- Andreas Holzinger
- Human-Centered AI Lab, University of Natural Resources & Life Sciences Vienna, Vienna, Austria
- Medical University Graz, Graz, Austria
| | - Anna Saranti
- Human-Centered AI Lab, University of Natural Resources & Life Sciences Vienna, Vienna, Austria
- Medical University Graz, Graz, Austria
| | - Alessa Angerschmid
- Human-Centered AI Lab, University of Natural Resources & Life Sciences Vienna, Vienna, Austria
- Medical University Graz, Graz, Austria
| | | | | | | |
Collapse
|
5
|
$$\alpha$$ILP: thinking visual scenes as differentiable logic programs. Mach Learn 2023. [DOI: 10.1007/s10994-023-06320-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/16/2023]
Abstract
AbstractDeep neural learning has shown remarkable performance at learning representations for visual object categorization. However, deep neural networks such as CNNs do not explicitly encode objects and relations among them. This limits their success on tasks that require a deep logical understanding of visual scenes, such as Kandinsky patterns and Bongard problems. To overcome these limitations, we introduce $$\alpha {\textit{ILP}}$$
α
ILP
, a novel differentiable inductive logic programming framework that learns to represent scenes as logic programs—intuitively, logical atoms correspond to objects, attributes, and relations, and clauses encode high-level scene information. $$\alpha$$
α
ILP has an end-to-end reasoning architecture from visual inputs. Using it, $$\alpha$$
α
ILP performs differentiable inductive logic programming on complex visual scenes, i.e., the logical rules are learned by gradient descent. Our extensive experiments on Kandinsky patterns and CLEVR-Hans benchmarks demonstrate the accuracy and efficiency of $$\alpha {\textit{ILP}}$$
α
ILP
in learning complex visual-logical concepts.
Collapse
|
6
|
Kirubeswaran OR, Storrs KR. Inconsistent illusory motion in predictive coding deep neural networks. Vision Res 2023; 206:108195. [PMID: 36801664 DOI: 10.1016/j.visres.2023.108195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 01/31/2023] [Accepted: 01/31/2023] [Indexed: 02/19/2023]
Abstract
Why do we perceive illusory motion in some static images? Several accounts point to eye movements, response latencies to different image elements, or interactions between image patterns and motion energy detectors. Recently PredNet, a recurrent deep neural network (DNN) based on predictive coding principles, was reported to reproduce the "Rotating Snakes" illusion, suggesting a role for predictive coding. We begin by replicating this finding, then use a series of "in silico" psychophysics and electrophysiology experiments to examine whether PredNet behaves consistently with human observers and non-human primate neural data. A pretrained PredNet predicted illusory motion for all subcomponents of the Rotating Snakes pattern, consistent with human observers. However, we found no simple response delays in internal units, unlike evidence from electrophysiological data. PredNet's detection of motion in gradients seemed dependent on contrast, but depends predominantly on luminance in humans. Finally, we examined the robustness of the illusion across ten PredNets of identical architecture, retrained on the same video data. There was large variation across network instances in whether they reproduced the Rotating Snakes illusion, and what motion, if any, they predicted for simplified variants. Unlike human observers, no network predicted motion for greyscale variants of the Rotating Snakes pattern. Our results sound a cautionary note: even when a DNN successfully reproduces some idiosyncrasy of human vision, more detailed investigation can reveal inconsistencies between humans and the network, and between different instances of the same network. These inconsistencies suggest that predictive coding does not reliably give rise to human-like illusory motion.
Collapse
Affiliation(s)
| | - Katherine R Storrs
- Department of Experimental Psychology, Justus Liebig University Giessen, Germany; Centre for Mind, Brain and Behaviour (CMBB), University of Marburg and Justus Liebig University Giessen, Germany; School of Psychology, University of Auckland, New Zealand
| |
Collapse
|
7
|
Zerroug A, Vaishnav M, Colin J, Musslick S, Serre T. A Benchmark for Compositional Visual Reasoning. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2022; 35:29776-29788. [PMID: 37534101 PMCID: PMC10396074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/04/2023]
Abstract
A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, there remains a major gap between humans and AI systems in terms of the sample efficiency with which they learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality - allowing them to efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluid intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abstract rules and generating image datasets corresponding to these rules at scale. Our proposed benchmark includes measures of sample efficiency, generalization, compositionality, and transfer across task rules. We systematically evaluate modern neural architectures and find that convolutional architectures surpass transformer-based architectures across all performance measures in most data regimes. However, all computational models are much less data efficient than humans, even after learning informative visual representations using self-supervision. Overall, we hope our challenge will spur interest in developing neural architectures that can learn to harness compositionality for more efficient learning.
Collapse
Affiliation(s)
- Aimen Zerroug
- Artificial and Natural Intelligence Toulouse Institute, Université de Toulouse, France
- Carney Institute for Brain Science, Dept. of Cognitive Linguistic & Psychological Sciences Brown University, Providence, RI 02912
- Centre de Recherche Cerveau et Cognition, CNRS, Université de Toulouse, France
| | - Mohit Vaishnav
- Artificial and Natural Intelligence Toulouse Institute, Université de Toulouse, France
- Carney Institute for Brain Science, Dept. of Cognitive Linguistic & Psychological Sciences Brown University, Providence, RI 02912
- Centre de Recherche Cerveau et Cognition, CNRS, Université de Toulouse, France
| | - Julien Colin
- Carney Institute for Brain Science, Dept. of Cognitive Linguistic & Psychological Sciences Brown University, Providence, RI 02912
| | - Sebastian Musslick
- Carney Institute for Brain Science, Dept. of Cognitive Linguistic & Psychological Sciences Brown University, Providence, RI 02912
| | - Thomas Serre
- Artificial and Natural Intelligence Toulouse Institute, Université de Toulouse, France
- Carney Institute for Brain Science, Dept. of Cognitive Linguistic & Psychological Sciences Brown University, Providence, RI 02912
| |
Collapse
|
8
|
Puebla G, Bowers JS. Can deep convolutional neural networks support relational reasoning in the same-different task? J Vis 2022; 22:11. [PMID: 36094524 PMCID: PMC9482325 DOI: 10.1167/jov.22.10.11] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Same-different visual reasoning is a basic skill central to abstract combinatorial thought. This fact has lead neural networks researchers to test same-different classification on deep convolutional neural networks (DCNNs), which has resulted in a controversy regarding whether this skill is within the capacity of these models. However, most tests of same-different classification rely on testing on images that come from the same pixel-level distribution as the training images, yielding the results inconclusive. In this study, we tested relational same-different reasoning in DCNNs. In a series of simulations we show that models based on the ResNet architecture are capable of visual same-different classification, but only when the test images are similar to the training images at the pixel level. In contrast, when there is a shift in the testing distribution that does not change the relation between the objects in the image, the performance of DCNNs decreases substantially. This finding is true even when the DCNNs’ training regime is expanded to include images taken from a wide range of different pixel-level distributions or when the model is trained on the testing distribution but on a different task in a multitask learning context. Furthermore, we show that the relation network, a deep learning architecture specifically designed to tackle visual relational reasoning problems, suffers the same kind of limitations. Overall, the results of this study suggest that learning same-different relations is beyond the scope of current DCNNs.
Collapse
|
9
|
Methods for Facial Expression Recognition with Applications in Challenging Situations. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:9261438. [PMID: 35665283 PMCID: PMC9159845 DOI: 10.1155/2022/9261438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 04/12/2022] [Accepted: 04/18/2022] [Indexed: 11/17/2022]
Abstract
In the last few years, a great deal of interesting research has been achieved on automatic facial emotion recognition (FER). FER has been used in a number of ways to make human-machine interactions better, including human center computing and the new trends of emotional artificial intelligence (EAI). Researchers in the EAI field aim to make computers better at predicting and analyzing the facial expressions and behavior of human under different scenarios and cases. Deep learning has had the greatest influence on such a field since neural networks have evolved significantly in recent years, and accordingly, different architectures are being developed to solve more and more difficult problems. This article will address the latest advances in computational intelligence-related automated emotion recognition using recent deep learning models. We show that both deep learning-based FER and models that use architecture-related methods, such as databases, can collaborate well in delivering highly accurate results.
Collapse
|
10
|
Neri P. Deep networks may capture biological behavior for shallow, but not deep, empirical characterizations. Neural Netw 2022; 152:244-266. [PMID: 35567948 DOI: 10.1016/j.neunet.2022.04.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Revised: 04/15/2022] [Accepted: 04/20/2022] [Indexed: 11/19/2022]
Abstract
We assess whether deep convolutional networks (DCN) can account for a most fundamental property of human vision: detection/discrimination of elementary image elements (bars) at different contrast levels. The human visual process can be characterized to varying degrees of "depth," ranging from percentage of correct detection to detailed tuning and operating characteristics of the underlying perceptual mechanism. We challenge deep networks with the same stimuli/tasks used with human observers and apply equivalent characterization of the stimulus-response coupling. In general, we find that popular DCN architectures do not account for signature properties of the human process. For shallow depth of characterization, some variants of network-architecture/training-protocol produce human-like trends; however, more articulate empirical descriptors expose glaring discrepancies. Networks can be coaxed into learning those richer descriptors by shadowing a human surrogate in the form of a tailored circuit perturbed by unstructured input, thus ruling out the possibility that human-model misalignment in standard protocols may be attributable to insufficient representational power. These results urge caution in assessing whether neural networks do or do not capture human behavior: ultimately, our ability to assess "success" in this area can only be as good as afforded by the depth of behavioral characterization against which the network is evaluated. We propose a novel set of metrics/protocols that impose stringent constraints on the evaluation of DCN behavior as an adequate approximation to biological processes.
Collapse
Affiliation(s)
- Peter Neri
- Laboratoire des Systèmes Perceptifs (UMR8248), École normale supérieure, PSL Research University, Paris, France.
| |
Collapse
|
11
|
Vaishnav M, Cadene R, Alamia A, Linsley D, VanRullen R, Serre T. Understanding the Computational Demands Underlying Visual Reasoning. Neural Comput 2022; 34:1075-1099. [PMID: 35231926 DOI: 10.1162/neco_a_01485] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 12/07/2021] [Indexed: 11/04/2022]
Abstract
Visual understanding requires comprehending complex visual relations between objects within a scene. Here, we seek to characterize the computational demands for abstract visual reasoning. We do this by systematically assessing the ability of modern deep convolutional neural networks (CNNs) to learn to solve the synthetic visual reasoning test (SVRT) challenge, a collection of 23 visual reasoning problems. Our analysis reveals a novel taxonomy of visual reasoning tasks, which can be primarily explained by both the type of relations (same-different versus spatial-relation judgments) and the number of relations used to compose the underlying rules. Prior cognitive neuroscience work suggests that attention plays a key role in humans' visual reasoning ability. To test this hypothesis, we extended the CNNs with spatial and feature-based attention mechanisms. In a second series of experiments, we evaluated the ability of these attention networks to learn to solve the SVRT challenge and found the resulting architectures to be much more efficient at solving the hardest of these visual reasoning tasks. Most important, the corresponding improvements on individual tasks partially explained our novel taxonomy. Overall, this work provides a granular computational account of visual reasoning and yields testable neuroscience predictions regarding the differential need for feature-based versus spatial attention depending on the type of visual reasoning problem.
Collapse
Affiliation(s)
- Mohit Vaishnav
- Artificial and Natural Intelligence Toulouse Institute, Université de Toulouse, 31052 Toulose, France.,Carney Institute for Brain Science, Department of Cognitive Linguistic and Psychological Sciences, Brown University, Providence, RI 02912, U.S.A.
| | - Remi Cadene
- Carney Institute for Brain Science, Department of Cognitive Linguistic and Psychological Sciences, Brown University, Providence, RI 02912, U.S.A.
| | - Andrea Alamia
- Centre de Recherche Cerveau et Cognition, CNRS, Université de Toulouse, 31052 Toulouse, France
| | - Drew Linsley
- Carney Institute for Brain Science, Department of Cognitive Linguistic and Psychological Sciences, Brown University, Providence, RI 02912, U.S.A.
| | - Rufin VanRullen
- Artificial and Natural Intelligence, Toulouse Institute, Université de Toulouse, and Centre de Recherche Cerveau et Cognition, CNRS, Université de Toulouse, 31052 Toulouse, France
| | - Thomas Serre
- Artificial and Natural Intelligence Toulouse Institute, Université de Toulouse, 31052 Toulouse, France.,Carney Institute for Brain Science, Department of Cognitive Linguistic and Psychological Sciences, Brown University, Providence, RI 02912, U.S.A.
| |
Collapse
|
12
|
Teodorescu L, Hofmann K, Oudeyer PY. SpatialSim: Recognizing Spatial Configurations of Objects With Graph Neural Networks. Front Artif Intell 2022; 4:782081. [PMID: 35156011 PMCID: PMC8826049 DOI: 10.3389/frai.2021.782081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 12/20/2021] [Indexed: 11/13/2022] Open
Abstract
An embodied, autonomous agent able to set its own goals has to possess geometrical reasoning abilities for judging whether its goals have been achieved, namely it should be able to identify and discriminate classes of configurations of objects, irrespective of its point of view on the scene. However, this problem has received little attention so far in the deep learning literature. In this paper we make two key contributions. First, we propose SpatialSim (Spatial Similarity), a novel geometrical reasoning diagnostic dataset, and argue that progress on this benchmark would allow for diagnosing more principled approaches to this problem. This benchmark is composed of two tasks: “Identification” and “Discrimination,” each one instantiated in increasing levels of difficulty. Secondly, we validate that relational inductive biases—exhibited by fully-connected message-passing Graph Neural Networks (MPGNNs)—are instrumental to solve those tasks, and show their advantages over less relational baselines such as Deep Sets and unstructured models such as Multi-Layer Perceptrons. We additionally showcase the failure of high-capacity CNNs on the hard Discrimination task. Finally, we highlight the current limits of GNNs in both tasks.
Collapse
Affiliation(s)
- Laetitia Teodorescu
- Flowers Team, Inria Bordeaux, Talence, France
- *Correspondence: Laetitia Teodorescu
| | | | | |
Collapse
|
13
|
Baker N, Garrigan P, Phillips A, Kellman PJ. Configural relations in humans and deep convolutional neural networks. Front Artif Intell 2022; 5:961595. [PMID: 36937367 PMCID: PMC10014814 DOI: 10.3389/frai.2022.961595] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2022] [Accepted: 12/23/2022] [Indexed: 03/05/2023] Open
Abstract
Deep convolutional neural networks (DCNNs) have attracted considerable interest as useful devices and as possible windows into understanding perception and cognition in biological systems. In earlier work, we showed that DCNNs differ dramatically from human perceivers in that they have no sensitivity to global object shape. Here, we investigated whether those findings are symptomatic of broader limitations of DCNNs regarding the use of relations. We tested learning and generalization of DCNNs (AlexNet and ResNet-50) for several relations involving objects. One involved classifying two shapes in an otherwise empty field as same or different. Another involved enclosure. Every display contained a closed figure among contour noise fragments and one dot; correct responding depended on whether the dot was inside or outside the figure. The third relation we tested involved a classification that depended on which of two polygons had more sides. One polygon always contained a dot, and correct classification of each display depended on whether the polygon with the dot had a greater number of sides. We used DCNNs that had been trained on the ImageNet database, and we used both restricted and unrestricted transfer learning (connection weights at all layers could change with training). For the same-different experiment, there was little restricted transfer learning (82.2%). Generalization tests showed near chance performance for new shapes. Results for enclosure were at chance for restricted transfer learning and somewhat better for unrestricted (74%). Generalization with two new kinds of shapes showed reduced but above-chance performance (≈66%). Follow-up studies indicated that the networks did not access the enclosure relation in their responses. For the relation of more or fewer sides of polygons, DCNNs showed successful learning with polygons having 3-5 sides under unrestricted transfer learning, but showed chance performance in generalization tests with polygons having 6-10 sides. Experiments with human observers showed learning from relatively few examples of all of the relations tested and complete generalization of relational learning to new stimuli. These results using several different relations suggest that DCNNs have crucial limitations that derive from their lack of computations involving abstraction and relational processing of the sort that are fundamental in human perception.
Collapse
Affiliation(s)
- Nicholas Baker
- Department of Psychology, Loyola University Chicago, Chicago, IL, United States
| | - Patrick Garrigan
- Department of Psychology, Saint Joseph's University, Philadelphia, PA, United States
| | - Austin Phillips
- UCLA Human Perception Laboratory, Department of Psychology, University of California, Los Angeles, Los Angeles, CA, United States
| | - Philip J. Kellman
- UCLA Human Perception Laboratory, Department of Psychology, University of California, Los Angeles, Los Angeles, CA, United States
- *Correspondence: Philip J. Kellman
| |
Collapse
|
14
|
Stabinger S, Peer D, Piater J, Rodríguez-Sánchez A. Evaluating the progress of deep learning for visual relational concepts. J Vis 2021; 21:8. [PMID: 34636844 PMCID: PMC8525837 DOI: 10.1167/jov.21.11.8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 09/05/2021] [Indexed: 11/24/2022] Open
Abstract
Convolutional neural networks have become the state-of-the-art method for image classification in the last 10 years. Despite the fact that they achieve superhuman classification accuracy on many popular datasets, they often perform much worse on more abstract image classification tasks. We will show that these difficult tasks are linked to relational concepts from cognitive psychology and that despite progress over the last few years, such relational reasoning tasks still remain difficult for current neural network architectures. We will review deep learning research that is linked to relational concept learning, even if it was not originally presented from this angle. Reviewing the current literature, we will argue that some form of attention will be an important component of future systems to solve relational tasks. In addition, we will point out the shortcomings of currently used datasets, and we will recommend steps to make future datasets more relevant for testing systems on relational reasoning.
Collapse
Affiliation(s)
| | - David Peer
- Universität Innsbruck, Innsbruck, Austria
- https://iis.uibk.ac.at
| | - Justus Piater
- Universität Innsbruck, Innsbruck, Austria
- https://iis.uibk.ac.at
| | | |
Collapse
|
15
|
Villalobos K, Štih V, Ahmadinejad A, Sundaram S, Dozier J, Francl A, Azevedo F, Sasaki T, Boix X. Do Neural Networks for Segmentation Understand Insideness? Neural Comput 2021; 33:2511-2549. [PMID: 34412113 DOI: 10.1162/neco_a_01413] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Accepted: 03/18/2021] [Indexed: 11/04/2022]
Abstract
The insideness problem is an aspect of image segmentation that consists of determining which pixels are inside and outside a region. Deep neural networks (DNNs) excel in segmentation benchmarks, but it is unclear if they have the ability to solve the insideness problem as it requires evaluating long-range spatial dependencies. In this letter, we analyze the insideness problem in isolation, without texture or semantic cues, such that other aspects of segmentation do not interfere in the analysis. We demonstrate that DNNs for segmentation with few units have sufficient complexity to solve the insideness for any curve. Yet such DNNs have severe problems with learning general solutions. Only recurrent networks trained with small images learn solutions that generalize well to almost any curve. Recurrent networks can decompose the evaluation of long-range dependencies into a sequence of local operations, and learning with small images alleviates the common difficulties of training recurrent networks with a large number of unrolling steps.
Collapse
Affiliation(s)
| | - Vilim Štih
- Center for Brains, Minds and Machines, MIT, Cambridge, MA 02139, U.S.A., and Max Planck Institute of Neurobiology, 82152 Martinsried, Germany
| | | | - Shobhita Sundaram
- Center for Brains, Minds and Machines, MIT, Cambridge, MA 02139, U.S.A.
| | - Jamell Dozier
- Center for Brains, Minds and Machines, MIT, Cambridge, MA 02139, U.S.A.
| | - Andrew Francl
- Center for Brains, Minds and Machines, MIT, Cambridge, MA 02139, U.S.A.
| | - Frederico Azevedo
- Center for Brains, Minds and Machines, MIT, Cambridge, MA 02139, U.S.A.
| | - Tomotake Sasaki
- Fujitsu Laboratories, Kawasaki 211-8588, Japan, and Center for Brains, Minds and Machines, MIT, Cambridge, MA 02139, U.S.A.
| | - Xavier Boix
- Center for Brains, Minds and Machines, MIT, Cambridge, MA 02139, U.S.A.
| |
Collapse
|
16
|
Arguments for the unsuitability of convolutional neural networks for non-local tasks. Neural Netw 2021; 142:171-179. [PMID: 34000564 DOI: 10.1016/j.neunet.2021.05.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 02/12/2021] [Accepted: 05/03/2021] [Indexed: 11/22/2022]
Abstract
Convolutional neural networks have established themselves over the past years as the state of the art method for image classification, and for many datasets, they even surpass humans in categorizing images. Unfortunately, the same architectures perform much worse when they have to compare parts of an image to each other to correctly classify this image. Until now, no well-formed theoretical argument has been presented to explain this deficiency. In this paper, we will argue that convolutional layers are of little use for such problems, since comparison tasks are global by nature, but convolutional layers are local by design. We will use this insight to reformulate a comparison task into a sorting task and use findings on sorting networks to propose a lower bound for the number of parameters a neural network needs to solve comparison tasks in a generalizable way. We will use this lower bound to argue that attention, as well as iterative/recurrent processing, is needed to prevent a combinatorial explosion.
Collapse
|
17
|
Hafri A, Firestone C. The Perception of Relations. Trends Cogn Sci 2021; 25:475-492. [PMID: 33812770 DOI: 10.1016/j.tics.2021.01.006] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 01/05/2021] [Accepted: 01/18/2021] [Indexed: 11/16/2022]
Abstract
The world contains not only objects and features (red apples, glass bowls, wooden tables), but also relations holding between them (apples contained in bowls, bowls supported by tables). Representations of these relations are often developmentally precocious and linguistically privileged; but how does the mind extract them in the first place? Although relations themselves cast no light onto our eyes, a growing body of work suggests that even very sophisticated relations display key signatures of automatic visual processing. Across physical, eventive, and social domains, relations such as support, fit, cause, chase, and even socially interact are extracted rapidly, are impossible to ignore, and influence other perceptual processes. Sophisticated and structured relations are not only judged and understood, but also seen - revealing surprisingly rich content in visual perception itself.
Collapse
Affiliation(s)
- Alon Hafri
- Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218, USA; Department of Cognitive Science, Johns Hopkins University, Baltimore, MD 21218, USA.
| | - Chaz Firestone
- Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218, USA; Department of Cognitive Science, Johns Hopkins University, Baltimore, MD 21218, USA; Department of Philosophy, Johns Hopkins University, Baltimore, MD 21218, USA.
| |
Collapse
|
18
|
Funke CM, Borowski J, Stosio K, Brendel W, Wallis TSA, Bethge M. Five points to check when comparing visual perception in humans and machines. J Vis 2021; 21:16. [PMID: 33724362 PMCID: PMC7980041 DOI: 10.1167/jov.21.3.16] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 12/02/2020] [Indexed: 11/24/2022] Open
Abstract
With the rise of machines to human-level performance in complex recognition tasks, a growing amount of work is directed toward comparing information processing in humans and machines. These studies are an exciting chance to learn about one system by studying the other. Here, we propose ideas on how to design, conduct, and interpret experiments such that they adequately support the investigation of mechanisms when comparing human and machine perception. We demonstrate and apply these ideas through three case studies. The first case study shows how human bias can affect the interpretation of results and that several analytic tools can help to overcome this human reference point. In the second case study, we highlight the difference between necessary and sufficient mechanisms in visual reasoning tasks. Thereby, we show that contrary to previous suggestions, feedback mechanisms might not be necessary for the tasks in question. The third case study highlights the importance of aligning experimental conditions. We find that a previously observed difference in object recognition does not hold when adapting the experiment to make conditions more equitable between humans and machines. In presenting a checklist for comparative studies of visual reasoning in humans and machines, we hope to highlight how to overcome potential pitfalls in design and inference.
Collapse
Affiliation(s)
| | | | - Karolina Stosio
- University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Tübingen and Berlin, Germany
- Volkswagen Group Machine Learning Research Lab, Munich, Germany
| | - Wieland Brendel
- University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Tübingen and Berlin, Germany
- Werner Reichardt Centre for Integrative Neuroscience, Tübingen, Germany
| | - Thomas S A Wallis
- University of Tübingen, Tübingen, Germany
- Present address: Amazon.com, Tübingen
| | - Matthias Bethge
- University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Tübingen and Berlin, Germany
- Werner Reichardt Centre for Integrative Neuroscience, Tübingen, Germany
| |
Collapse
|
19
|
Differential Involvement of EEG Oscillatory Components in Sameness versus Spatial-Relation Visual Reasoning Tasks. eNeuro 2021; 8:ENEURO.0267-20.2020. [PMID: 33239271 PMCID: PMC7877474 DOI: 10.1523/eneuro.0267-20.2020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 10/20/2020] [Accepted: 10/21/2020] [Indexed: 11/21/2022] Open
Abstract
The development of deep convolutional neural networks (CNNs) has recently led to great successes in computer vision, and CNNs have become de facto computational models of vision. However, a growing body of work suggests that they exhibit critical limitations on tasks beyond image categorization. Here, we study one such fundamental limitation, concerning the judgment of whether two simultaneously presented items are the same or different (SD) compared with a baseline assessment of their spatial relationship (SR). In both human subjects and artificial neural networks, we test the prediction that SD tasks recruit additional cortical mechanisms which underlie critical aspects of visual cognition that are not explained by current computational models. We thus recorded electroencephalography (EEG) signals from human participants engaged in the same tasks as the computational models. Importantly, in humans the two tasks were matched in terms of difficulty by an adaptive psychometric procedure; yet, on top of a modulation of evoked potentials (EPs), our results revealed higher activity in the low β (16–24 Hz) band in the SD compared with the SR conditions. We surmise that these oscillations reflect the crucial involvement of additional mechanisms, such as working memory and attention, which are missing in current feed-forward CNNs.
Collapse
|
20
|
Abstract
Does the human mind resemble the machines that can behave like it? Biologically inspired machine-learning systems approach "human-level" accuracy in an astounding variety of domains, and even predict human brain activity-raising the exciting possibility that such systems represent the world like we do. However, even seemingly intelligent machines fail in strange and "unhumanlike" ways, threatening their status as models of our minds. How can we know when human-machine behavioral differences reflect deep disparities in their underlying capacities, vs. when such failures are only superficial or peripheral? This article draws on a foundational insight from cognitive science-the distinction between performance and competence-to encourage "species-fair" comparisons between humans and machines. The performance/competence distinction urges us to consider whether the failure of a system to behave as ideally hypothesized, or the failure of one creature to behave like another, arises not because the system lacks the relevant knowledge or internal capacities ("competence"), but instead because of superficial constraints on demonstrating that knowledge ("performance"). I argue that this distinction has been neglected by research comparing human and machine behavior, and that it should be essential to any such comparison. Focusing on the domain of image classification, I identify three factors contributing to the species-fairness of human-machine comparisons, extracted from recent work that equates such constraints. Species-fair comparisons level the playing field between natural and artificial intelligence, so that we can separate more superficial differences from those that may be deep and enduring.
Collapse
Affiliation(s)
- Chaz Firestone
- Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218
| |
Collapse
|
21
|
Kreiman G, Serre T. Beyond the feedforward sweep: feedback computations in the visual cortex. Ann N Y Acad Sci 2020; 1464:222-241. [PMID: 32112444 DOI: 10.1111/nyas.14320] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Revised: 01/24/2020] [Accepted: 01/30/2020] [Indexed: 11/28/2022]
Abstract
Visual perception involves the rapid formation of a coarse image representation at the onset of visual processing, which is iteratively refined by late computational processes. These early versus late time windows approximately map onto feedforward and feedback processes, respectively. State-of-the-art convolutional neural networks, the main engine behind recent machine vision successes, are feedforward architectures. Their successes and limitations provide critical information regarding which visual tasks can be solved by purely feedforward processes and which require feedback mechanisms. We provide an overview of recent work in cognitive neuroscience and machine vision that highlights the possible role of feedback processes for both visual recognition and beyond. We conclude by discussing important open questions for future research.
Collapse
Affiliation(s)
- Gabriel Kreiman
- Children's Hospital, Harvard Medical School and Center for Brains, Minds, and Machines, Boston, Massachusetts
| | - Thomas Serre
- Cognitive Linguistic and Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, Rhode Island
| |
Collapse
|
22
|
Mao J, Yao Y, Heinrich S, Hinz T, Weber C, Wermter S, Liu Z, Sun M. Bootstrapping Knowledge Graphs From Images and Text. Front Neurorobot 2019; 13:93. [PMID: 31798437 PMCID: PMC6861514 DOI: 10.3389/fnbot.2019.00093] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Accepted: 10/28/2019] [Indexed: 11/17/2022] Open
Abstract
The problem of generating structured Knowledge Graphs (KGs) is difficult and open but relevant to a range of tasks related to decision making and information augmentation. A promising approach is to study generating KGs as a relational representation of inputs (e.g., textual paragraphs or natural images), where nodes represent the entities and edges represent the relations. This procedure is naturally a mixture of two phases: extracting primary relations from input, and completing the KG with reasoning. In this paper, we propose a hybrid KG builder that combines these two phases in a unified framework and generates KGs from scratch. Specifically, we employ a neural relation extractor resolving primary relations from input and a differentiable inductive logic programming (ILP) model that iteratively completes the KG. We evaluate our framework in both textual and visual domains and achieve comparable performance on relation extraction datasets based on Wikidata and the Visual Genome. The framework surpasses neural baselines by a noticeable gap in reasoning out dense KGs and overall performs particularly well for rare relations.
Collapse
Affiliation(s)
- Jiayuan Mao
- Natural Language Processing Lab, Department of Computer Science and Technology, Tsinghua University, Beijing, China
| | - Yuan Yao
- Natural Language Processing Lab, Department of Computer Science and Technology, Tsinghua University, Beijing, China
| | - Stefan Heinrich
- Knowledge Technology Group, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Tobias Hinz
- Knowledge Technology Group, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Cornelius Weber
- Knowledge Technology Group, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Stefan Wermter
- Knowledge Technology Group, Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Zhiyuan Liu
- Natural Language Processing Lab, Department of Computer Science and Technology, Tsinghua University, Beijing, China
| | - Maosong Sun
- Natural Language Processing Lab, Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
23
|
Papadimitriou A, Passalis N, Tefas A. Visual representation decoding from human brain activity using machine learning: A baseline study. Pattern Recognit Lett 2019. [DOI: 10.1016/j.patrec.2019.08.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
24
|
Abstract
Artificial vision has often been described as one of the key remaining challenges to be solved before machines can act intelligently. Recent developments in a branch of machine learning known as deep learning have catalyzed impressive gains in machine vision—giving a sense that the problem of vision is getting closer to being solved. The goal of this review is to provide a comprehensive overview of recent deep learning developments and to critically assess actual progress toward achieving human-level visual intelligence. I discuss the implications of the successes and limitations of modern machine vision algorithms for biological vision and the prospect for neuroscience to inform the design of future artificial vision systems.
Collapse
Affiliation(s)
- Thomas Serre
- Department of Cognitive Linguistic and Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, Rhode Island 02818, USA
| |
Collapse
|
25
|
Schofield AJ, Gilchrist ID, Bloj M, Leonardis A, Bellotto N. Understanding images in biological and computer vision. Interface Focus 2018. [DOI: 10.1098/rsfs.2018.0027] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Affiliation(s)
- Andrew J. Schofield
- School of Psychology, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | - Iain D. Gilchrist
- School of Experimental Psychology, University of Bristol, 12A Priory Road, Bristol, BS8 1TU, UK
| | - Marina Bloj
- School of Optometry and Vision Sciences, University of Bradford, Bradford, BD7 1DP, UK
| | - Ales Leonardis
- School of Computer Science, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | - Nicola Bellotto
- School of Computer Science, University of Lincoln, Brayford Pool, Lincoln, LN6 7TS, UK
| |
Collapse
|