1
|
Vong WK, Wang W, Orhan AE, Lake BM. Grounded language acquisition through the eyes and ears of a single child. Science 2024; 383:504-511. [PMID: 38300999 DOI: 10.1126/science.adi1374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 12/31/2023] [Indexed: 02/03/2024]
Abstract
Starting around 6 to 9 months of age, children begin acquiring their first words, linking spoken words to their visual counterparts. How much of this knowledge is learnable from sensory input with relatively generic learning mechanisms, and how much requires stronger inductive biases? Using longitudinal head-mounted camera recordings from one child aged 6 to 25 months, we trained a relatively generic neural network on 61 hours of correlated visual-linguistic data streams, learning feature-based representations and cross-modal associations. Our model acquires many word-referent mappings present in the child's everyday experience, enables zero-shot generalization to new visual referents, and aligns its visual and linguistic conceptual systems. These results show how critical aspects of grounded word meaning are learnable through joint representation and associative learning from one child's input.
Collapse
Affiliation(s)
- Wai Keen Vong
- Center for Data Science, New York University, New York, NY, USA
| | - Wentao Wang
- Center for Data Science, New York University, New York, NY, USA
| | - A Emin Orhan
- Center for Data Science, New York University, New York, NY, USA
| | - Brenden M Lake
- Center for Data Science, New York University, New York, NY, USA
- Department of Psychology, New York University, New York, NY, USA
| |
Collapse
|
2
|
Wang W, Vong WK, Kim N, Lake BM. Finding Structure in One Child's Linguistic Experience. Cogn Sci 2023; 47:e13305. [PMID: 37358026 DOI: 10.1111/cogs.13305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/16/2023] [Accepted: 05/22/2023] [Indexed: 06/27/2023]
Abstract
Neural network models have recently made striking progress in natural language processing, but they are typically trained on orders of magnitude more language input than children receive. What can these neural networks, which are primarily distributional learners, learn from a naturalistic subset of a single child's experience? We examine this question using a recent longitudinal dataset collected from a single child, consisting of egocentric visual data paired with text transcripts. We train both language-only and vision-and-language neural networks and analyze the linguistic knowledge they acquire. In parallel with findings from Jeffrey Elman's seminal work, the neural networks form emergent clusters of words corresponding to syntactic (nouns, transitive and intransitive verbs) and semantic categories (e.g., animals and clothing), based solely on one child's linguistic input. The networks also acquire sensitivity to acceptability contrasts from linguistic phenomena, such as determiner-noun agreement and argument structure. We find that incorporating visual information produces an incremental gain in predicting words in context, especially for syntactic categories that are comparatively more easily grounded, such as nouns and verbs, but the underlying linguistic representations are not fundamentally altered. Our findings demonstrate which kinds of linguistic knowledge are learnable from a snapshot of a single child's real developmental experience.
Collapse
Affiliation(s)
- Wentao Wang
- Center for Data Science, New York University
| | | | - Najoung Kim
- Center for Data Science, New York University
- Department of Linguistics, Boston University
| | - Brenden M Lake
- Center for Data Science, New York University
- Department of Psychology, New York University
| |
Collapse
|
3
|
Vong WK, Lake BM. Cross-Situational Word Learning With Multimodal Neural Networks. Cogn Sci 2022; 46:e13122. [PMID: 35377475 DOI: 10.1111/cogs.13122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Revised: 12/02/2021] [Accepted: 01/21/2022] [Indexed: 11/28/2022]
Abstract
In order to learn the mappings from words to referents, children must integrate co-occurrence information across individually ambiguous pairs of scenes and utterances, a challenge known as cross-situational word learning. In machine learning, recent multimodal neural networks have been shown to learn meaningful visual-linguistic mappings from cross-situational data, as needed to solve problems such as image captioning and visual question answering. These networks are potentially appealing as cognitive models because they can learn from raw visual and linguistic stimuli, something previous cognitive models have not addressed. In this paper, we examine whether recent machine learning approaches can help explain various behavioral phenomena from the psychological literature on cross-situational word learning. We consider two variants of a multimodal neural network architecture and look at seven different phenomena associated with cross-situational word learning and word learning more generally. Our results show that these networks can learn word-referent mappings from a single epoch of training, mimicking the amount of training commonly found in cross-situational word learning experiments. Additionally, these networks capture some, but not all of the phenomena we studied, with all of the failures related to reasoning via mutual exclusivity. These results provide insight into the kinds of phenomena that arise naturally from relatively generic neural network learning algorithms, and which word learning phenomena require additional inductive biases.
Collapse
Affiliation(s)
| | - Brenden M Lake
- Center for Data Science, New York University.,Department of Psychology, New York University
| |
Collapse
|
4
|
Bass I, Bonawitz E, Hawthorne-Madell D, Vong WK, Goodman ND, Gweon H. The effects of information utility and teachers' knowledge on evaluations of under-informative pedagogy across development. Cognition 2022; 222:104999. [PMID: 35032868 DOI: 10.1016/j.cognition.2021.104999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Revised: 11/12/2021] [Accepted: 12/22/2021] [Indexed: 11/03/2022]
Abstract
Teaching is a powerful way to transmit knowledge, but with this power comes a hazard: When teachers fail to select the best set of evidence for the learner, learners can be misled to draw inaccurate inferences. Evaluating others' failures as teachers, however, is a nontrivial problem; people may fail to be informative for different reasons, and not all failures are equally blameworthy. How do learners evaluate the quality of teachers, and what factors influence such evaluations? Here, we present a Bayesian model of teacher evaluation that considers the utility of a teacher's pedagogical sampling given their prior knowledge. In Experiment 1 (N=1168), we test the model predictions against adults' evaluations of a teacher who demonstrated all or a subset of the functions on a novel device. Consistent with the model predictions, participants' ratings integrated information about the number of functions taught, their values, as well as how much the teacher knew. Using a modified paradigm for children, Experiments 2 (N=48) and 3 (N=40) found that preschool-aged children (2a, 3) and adults (2b) make nuanced judgments of teacher quality that are well predicted by the model. However, after an unsuccessful attempt to replicate the results with preschoolers (Experiment 4, N=24), in Experiment 5 (N=24) we further investigate the development of teacher evaluation in a sample of seven- and eight-year-olds. These older children successfully distinguished teachers based on the amount and value of what was demonstrated, and their ability to evaluate omissions relative to the teacher's knowledge state was related to their tendency to spontaneously reference the teacher's knowledge when explaining their evaluations. In sum, our work illustrates how the human ability to learn from others supports not just learning about the world but also learning about the teachers themselves. By reasoning about others' informativeness, learners can evaluate others' teaching and make better learning decisions.
Collapse
Affiliation(s)
- Ilona Bass
- Department of Psychology, Harvard University, Cambridge, MA 02138, United States.
| | - Elizabeth Bonawitz
- Graduate School of Education, Harvard University, Cambridge, MA 02138, United States.
| | | | - Wai Keen Vong
- Center for Data Science, New York University, New York, NY 10011, United States.
| | - Noah D Goodman
- Department of Psychology, Stanford University, Stanford, CA 94305, United States.
| | - Hyowon Gweon
- Department of Psychology, Stanford University, Stanford, CA 94305, United States.
| |
Collapse
|
5
|
Yang SCH, Vong WK, Sojitra RB, Folke T, Shafto P. Mitigating belief projection in explainable artificial intelligence via Bayesian teaching. Sci Rep 2021; 11:9863. [PMID: 33972625 PMCID: PMC8110978 DOI: 10.1038/s41598-021-89267-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 04/08/2021] [Indexed: 11/09/2022] Open
Abstract
State-of-the-art deep-learning systems use decision rules that are challenging for humans to model. Explainable AI (XAI) attempts to improve human understanding but rarely accounts for how people typically reason about unfamiliar agents. We propose explicitly modelling the human explainee via Bayesian teaching, which evaluates explanations by how much they shift explainees' inferences toward a desired goal. We assess Bayesian teaching in a binary image classification task across a variety of contexts. Absent intervention, participants predict that the AI's classifications will match their own, but explanations generated by Bayesian teaching improve their ability to predict the AI's judgements by moving them away from this prior belief. Bayesian teaching further allows each case to be broken down into sub-examples (here saliency maps). These sub-examples complement whole examples by improving error detection for familiar categories, whereas whole examples help predict correct AI judgements of unfamiliar cases.
Collapse
Affiliation(s)
- Scott Cheng-Hsin Yang
- Department of Mathematics and Computer Science, Rutgers University, 101 Warren Street, Newark, NJ, 07102, USA.
| | - Wai Keen Vong
- Center for Data Science, New York University, 60 5th Ave, New York, NY, 10011, USA
| | - Ravi B Sojitra
- Department of Management Science and Engineering, Stanford University, Stanford, USA
| | - Tomas Folke
- Department of Mathematics and Computer Science, Rutgers University, 101 Warren Street, Newark, NJ, 07102, USA
| | - Patrick Shafto
- Department of Mathematics and Computer Science, Rutgers University, 101 Warren Street, Newark, NJ, 07102, USA
| |
Collapse
|
6
|
Vong WK, Hendrickson AT, Navarro DJ, Perfors A. Do Additional Features Help or Hurt Category Learning? The Curse of Dimensionality in Human Learners. Cogn Sci 2019; 43:e12724. [PMID: 30900291 DOI: 10.1111/cogs.12724] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 02/21/2019] [Accepted: 02/26/2019] [Indexed: 11/28/2022]
Abstract
The curse of dimensionality, which has been widely studied in statistics and machine learning, occurs when additional features cause the size of the feature space to grow so quickly that learning classification rules becomes increasingly difficult. How do people overcome the curse of dimensionality when acquiring real-world categories that have many different features? Here we investigate the possibility that the structure of categories can help. We show that when categories follow a family resemblance structure, people are unaffected by the presence of additional features in learning. However, when categories are based on a single feature, they fall prey to the curse, and having additional irrelevant features hurts performance. We compare and contrast these results to three different computational models to show that a model with limited computational capacity best captures human performance across almost all of the conditions in both experiments.
Collapse
Affiliation(s)
- Wai Keen Vong
- Department of Mathematics and Computer Science, Rutgers University-Newark
| | | | | | - Amy Perfors
- School of Psychological Sciences, University of Melbourne
| |
Collapse
|
7
|
Yang SC, Vong WK, Yu Y, Shafto P. A Unifying Computational Framework for Teaching and Active Learning. Top Cogn Sci 2019; 11:316-337. [DOI: 10.1111/tops.12405] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2017] [Revised: 11/16/2018] [Accepted: 11/28/2018] [Indexed: 11/26/2022]
Affiliation(s)
| | - Wai Keen Vong
- Department of Mathematics & Computer Science Rutgers University—Newark
| | - Yue Yu
- Centre for Research in Child Development National Institute of Education Singapore
| | - Patrick Shafto
- Department of Mathematics & Computer Science Rutgers University—Newark
| |
Collapse
|