1
|
Davidson G, Orhan AE, Lake BM. Spatial relation categorization in infants and deep neural networks. Cognition 2024; 245:105690. [PMID: 38330851 DOI: 10.1016/j.cognition.2023.105690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2023] [Revised: 12/06/2023] [Accepted: 12/07/2023] [Indexed: 02/10/2024]
Abstract
Spatial relations, such as above, below, between, and containment, are important mediators in children's understanding of the world (Piaget, 1954). The development of these relational categories in infancy has been extensively studied (Quinn, 2003) yet little is known about their computational underpinnings. Using developmental tests, we examine the extent to which deep neural networks, pretrained on a standard vision benchmark or egocentric video captured from one baby's perspective, form categorical representations for visual stimuli depicting relations. Notably, the networks did not receive any explicit training on relations. We then analyze whether these networks recover similar patterns to ones identified in development, such as reproducing the relative difficulty of categorizing different spatial relations and different stimulus abstractions. We find that the networks we evaluate tend to recover many of the patterns observed with the simpler relations of "above versus below" or "between versus outside", but struggle to match developmental findings related to "containment". We identify factors in the choice of model architecture, pretraining data, and experimental design that contribute to the extent the networks match developmental patterns, and highlight experimental predictions made by our modeling results. Our results open the door to modeling infants' earliest categorization abilities with modern machine learning tools and demonstrate the utility and productivity of this approach.
Collapse
Affiliation(s)
- Guy Davidson
- Center for Data Science, New York University, United States of America.
| | - A Emin Orhan
- Center for Data Science, New York University, United States of America
| | - Brenden M Lake
- Center for Data Science, New York University, United States of America; Department of Psychology, New York University, United States of America
| |
Collapse
|
2
|
Zhou Y, Feinman R, Lake BM. Compositional diversity in visual concept learning. Cognition 2024; 244:105711. [PMID: 38224649 DOI: 10.1016/j.cognition.2023.105711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 12/20/2023] [Accepted: 12/21/2023] [Indexed: 01/17/2024]
Abstract
Humans leverage compositionality to efficiently learn new concepts, understanding how familiar parts can combine together to form novel objects. In contrast, popular computer vision models struggle to make the same types of inferences, requiring more data and generalizing less flexibly than people do. Here, we study these distinctively human abilities across a range of different types of visual composition, examining how people classify and generate "alien figures" with rich relational structure. We also develop a Bayesian program induction model which searches for the best programs for generating the candidate visual figures, utilizing a large program space containing different compositional mechanisms and abstractions. In few shot classification tasks, we find that people and the program induction model can make a range of meaningful compositional generalizations, with the model providing a strong account of the experimental data as well as interpretable parameters that reveal human assumptions about the factors invariant to category membership (here, to rotation and changing part attachment). In few shot generation tasks, both people and the models are able to construct compelling novel examples, with people behaving in additional structured ways beyond the model capabilities, e.g. making choices that complete a set or reconfigure existing parts in new ways. To capture these additional behavioral patterns, we develop an alternative model based on neuro-symbolic program induction: this model also composes new concepts from existing parts yet, distinctively, it utilizes neural network modules to capture residual statistical structure. Together, our behavioral and computational findings show how people and models can produce a variety of compositional behavior when classifying and generating visual objects.
Collapse
Affiliation(s)
- Yanli Zhou
- Center for Data Science, New York University, United States of America.
| | - Reuben Feinman
- Center for Neural Science, New York University, United States of America.
| | - Brenden M Lake
- Center for Data Science, New York University, United States of America; Department of Psychology, New York University, United States of America.
| |
Collapse
|
3
|
Vong WK, Wang W, Orhan AE, Lake BM. Grounded language acquisition through the eyes and ears of a single child. Science 2024; 383:504-511. [PMID: 38300999 DOI: 10.1126/science.adi1374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 12/31/2023] [Indexed: 02/03/2024]
Abstract
Starting around 6 to 9 months of age, children begin acquiring their first words, linking spoken words to their visual counterparts. How much of this knowledge is learnable from sensory input with relatively generic learning mechanisms, and how much requires stronger inductive biases? Using longitudinal head-mounted camera recordings from one child aged 6 to 25 months, we trained a relatively generic neural network on 61 hours of correlated visual-linguistic data streams, learning feature-based representations and cross-modal associations. Our model acquires many word-referent mappings present in the child's everyday experience, enables zero-shot generalization to new visual referents, and aligns its visual and linguistic conceptual systems. These results show how critical aspects of grounded word meaning are learnable through joint representation and associative learning from one child's input.
Collapse
Affiliation(s)
- Wai Keen Vong
- Center for Data Science, New York University, New York, NY, USA
| | - Wentao Wang
- Center for Data Science, New York University, New York, NY, USA
| | - A Emin Orhan
- Center for Data Science, New York University, New York, NY, USA
| | - Brenden M Lake
- Center for Data Science, New York University, New York, NY, USA
- Department of Psychology, New York University, New York, NY, USA
| |
Collapse
|
4
|
Lake BM, Baroni M. Human-like systematic generalization through a meta-learning neural network. Nature 2023; 623:115-121. [PMID: 37880371 PMCID: PMC10620072 DOI: 10.1038/s41586-023-06668-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 09/21/2023] [Indexed: 10/27/2023]
Abstract
The power of human language and thought arises from systematic compositionality-the algebraic ability to understand and produce novel combinations from known components. Fodor and Pylyshyn1 famously argued that artificial neural networks lack this capacity and are therefore not viable models of the mind. Neural networks have advanced considerably in the years since, yet the systematicity challenge persists. Here we successfully address Fodor and Pylyshyn's challenge by providing evidence that neural networks can achieve human-like systematicity when optimized for their compositional skills. To do so, we introduce the meta-learning for compositionality (MLC) approach for guiding training through a dynamic stream of compositional tasks. To compare humans and machines, we conducted human behavioural experiments using an instruction learning paradigm. After considering seven different models, we found that, in contrast to perfectly systematic but rigid probabilistic symbolic models, and perfectly flexible but unsystematic neural networks, only MLC achieves both the systematicity and flexibility needed for human-like generalization. MLC also advances the compositional skills of machine learning systems in several systematic generalization benchmarks. Our results show how a standard neural network architecture, optimized for its compositional skills, can mimic human systematic generalization in a head-to-head comparison.
Collapse
Affiliation(s)
- Brenden M Lake
- Department of Psychology and Center for Data Science, New York University, New York, NY, USA.
| | - Marco Baroni
- Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain
- Department of Translation and Language Sciences, Universitat Pompeu Fabra, Barcelona, Spain
| |
Collapse
|
5
|
Wang W, Vong WK, Kim N, Lake BM. Finding Structure in One Child's Linguistic Experience. Cogn Sci 2023; 47:e13305. [PMID: 37358026 DOI: 10.1111/cogs.13305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/16/2023] [Accepted: 05/22/2023] [Indexed: 06/27/2023]
Abstract
Neural network models have recently made striking progress in natural language processing, but they are typically trained on orders of magnitude more language input than children receive. What can these neural networks, which are primarily distributional learners, learn from a naturalistic subset of a single child's experience? We examine this question using a recent longitudinal dataset collected from a single child, consisting of egocentric visual data paired with text transcripts. We train both language-only and vision-and-language neural networks and analyze the linguistic knowledge they acquire. In parallel with findings from Jeffrey Elman's seminal work, the neural networks form emergent clusters of words corresponding to syntactic (nouns, transitive and intransitive verbs) and semantic categories (e.g., animals and clothing), based solely on one child's linguistic input. The networks also acquire sensitivity to acceptability contrasts from linguistic phenomena, such as determiner-noun agreement and argument structure. We find that incorporating visual information produces an incremental gain in predicting words in context, especially for syntactic categories that are comparatively more easily grounded, such as nouns and verbs, but the underlying linguistic representations are not fundamentally altered. Our findings demonstrate which kinds of linguistic knowledge are learnable from a snapshot of a single child's real developmental experience.
Collapse
Affiliation(s)
- Wentao Wang
- Center for Data Science, New York University
| | | | - Najoung Kim
- Center for Data Science, New York University
- Department of Linguistics, Boston University
| | - Brenden M Lake
- Center for Data Science, New York University
- Department of Psychology, New York University
| |
Collapse
|
6
|
Stojnić G, Gandhi K, Yasuda S, Lake BM, Dillon MR. Commonsense psychology in human infants and machines. Cognition 2023; 235:105406. [PMID: 36801603 DOI: 10.1016/j.cognition.2023.105406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 02/08/2023] [Accepted: 02/09/2023] [Indexed: 02/18/2023]
Abstract
Human infants are fascinated by other people. They bring to this fascination a constellation of rich and flexible expectations about the intentions motivating people's actions. Here we test 11-month-old infants and state-of-the-art learning-driven neural-network models on the "Baby Intuitions Benchmark (BIB)," a suite of tasks challenging both infants and machines to make high-level predictions about the underlying causes of agents' actions. Infants expected agents' actions to be directed towards objects, not locations, and infants demonstrated default expectations about agents' rationally efficient actions towards goals. The neural-network models failed to capture infants' knowledge. Our work provides a comprehensive framework in which to characterize infants' commonsense psychology and takes the first step in testing whether human knowledge and human-like artificial intelligence can be built from the foundations cognitive and developmental theories postulate.
Collapse
Affiliation(s)
- Gala Stojnić
- Department of Psychology, New York University, New York, NY, USA
| | - Kanishk Gandhi
- Department of Computer Science, Stanford University, Palo Alto, CA, USA
| | - Shannon Yasuda
- Department of Psychology, New York University, New York, NY, USA
| | - Brenden M Lake
- Department of Psychology, New York University, New York, NY, USA; Center for Data Science, New York University, New York, NY, USA
| | - Moira R Dillon
- Department of Psychology, New York University, New York, NY, USA.
| |
Collapse
|
7
|
Vong WK, Lake BM. Cross-Situational Word Learning With Multimodal Neural Networks. Cogn Sci 2022; 46:e13122. [PMID: 35377475 DOI: 10.1111/cogs.13122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Revised: 12/02/2021] [Accepted: 01/21/2022] [Indexed: 11/28/2022]
Abstract
In order to learn the mappings from words to referents, children must integrate co-occurrence information across individually ambiguous pairs of scenes and utterances, a challenge known as cross-situational word learning. In machine learning, recent multimodal neural networks have been shown to learn meaningful visual-linguistic mappings from cross-situational data, as needed to solve problems such as image captioning and visual question answering. These networks are potentially appealing as cognitive models because they can learn from raw visual and linguistic stimuli, something previous cognitive models have not addressed. In this paper, we examine whether recent machine learning approaches can help explain various behavioral phenomena from the psychological literature on cross-situational word learning. We consider two variants of a multimodal neural network architecture and look at seven different phenomena associated with cross-situational word learning and word learning more generally. Our results show that these networks can learn word-referent mappings from a single epoch of training, mimicking the amount of training commonly found in cross-situational word learning experiments. Additionally, these networks capture some, but not all of the phenomena we studied, with all of the failures related to reasoning via mutual exclusivity. These results provide insight into the kinds of phenomena that arise naturally from relatively generic neural network learning algorithms, and which word learning phenomena require additional inductive biases.
Collapse
Affiliation(s)
| | - Brenden M Lake
- Center for Data Science, New York University.,Department of Psychology, New York University
| |
Collapse
|
8
|
Abstract
Machines have achieved a broad and growing set of linguistic competencies, thanks to recent progress in Natural Language Processing (NLP). Psychologists have shown increasing interest in such models, comparing their output to psychological judgments such as similarity, association, priming, and comprehension, raising the question of whether the models could serve as psychological theories. In this article, we compare how humans and machines represent the meaning of words. We argue that contemporary NLP systems are fairly successful models of human word similarity, but they fall short in many other respects. Current models are too strongly linked to the text-based patterns in large corpora, and too weakly linked to the desires, goals, and beliefs that people express through words. Word meanings must also be grounded in perception and action and be capable of flexible combinations in ways that current systems are not. We discuss promising approaches to grounding NLP systems and argue that they will be more successful, with a more human-like, conceptual basis for word meaning. (PsycInfo Database Record (c) 2021 APA, all rights reserved).
Collapse
|
9
|
Lewis M, Cristiano V, Lake BM, Kwan T, Frank MC. The role of developmental change and linguistic experience in the mutual exclusivity effect. Cognition 2020; 198:104191. [PMID: 32143015 DOI: 10.1016/j.cognition.2020.104191] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 01/06/2020] [Accepted: 01/14/2020] [Indexed: 11/26/2022]
Abstract
Given a novel word and a familiar and a novel referent, children have a bias to assume the novel word refers to the novel referent. This bias - often referred to as "Mutual Exclusivity" (ME) - is thought to be a potentially powerful route through which children might learn new word meanings, and, consequently, has been the focus of a large amount of empirical study and theorizing. Here, we focus on two aspects of the bias that have received relatively little attention in the literature: Development and experience. A successful theory of ME will need to provide an account for why the strength of the effect changes with the age of the child. We provide a quantitative description of the change in the strength of the bias across development, and investigate the role that linguistic experience plays in this developmental change. We first summarize the current body of empirical findings via a meta-analysis, and then present two experiments that examine the relationship between a child's amount of linguistic experience and the strength of the ME bias. We conclude that the strength of the bias varies dramatically across development and that linguistic experience is likely one causal factor contributing to this change. In the General Discussion, we describe how existing theories of ME can account for our findings, and highlight the value of computational modeling for future theorizing.
Collapse
Affiliation(s)
- Molly Lewis
- Carnegie Mellon University, United States of America.
| | | | - Brenden M Lake
- New York University, United States of America; Cognitive ToyBox, Inc., United States of America
| | - Tammy Kwan
- New York University, United States of America; Cognitive ToyBox, Inc., United States of America
| | | |
Collapse
|
10
|
|
11
|
Lake BM, Lawrence ND, Tenenbaum JB. The Emergence of Organizing Structure in Conceptual Representation. Cogn Sci 2018; 42 Suppl 3:809-832. [PMID: 29315735 DOI: 10.1111/cogs.12580] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2016] [Revised: 09/20/2017] [Accepted: 11/06/2017] [Indexed: 12/01/2022]
Abstract
Both scientists and children make important structural discoveries, yet their computational underpinnings are not well understood. Structure discovery has previously been formalized as probabilistic inference about the right structural form-where form could be a tree, ring, chain, grid, etc. (Kemp & Tenenbaum, 2008). Although this approach can learn intuitive organizations, including a tree for animals and a ring for the color circle, it assumes a strong inductive bias that considers only these particular forms, and each form is explicitly provided as initial knowledge. Here we introduce a new computational model of how organizing structure can be discovered, utilizing a broad hypothesis space with a preference for sparse connectivity. Given that the inductive bias is more general, the model's initial knowledge shows little qualitative resemblance to some of the discoveries it supports. As a consequence, the model can also learn complex structures for domains that lack intuitive description, as well as predict human property induction judgments without explicit structural forms. By allowing form to emerge from sparsity, our approach clarifies how both the richness and flexibility of human conceptual organization can coexist.
Collapse
Affiliation(s)
- Brenden M Lake
- Center for Data Science, New York University.,Department of Psychology, New York University
| | | | - Joshua B Tenenbaum
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology.,Center for Brains, Minds and Machines
| |
Collapse
|