1
|
Kelley MC, Perry SJ, Tucker BV. The Mason-Alberta Phonetic Segmenter: a forced alignment system based on deep neural networks and interpolation. PHONETICA 2024:phon-2024-0015. [PMID: 39248125 DOI: 10.1515/phon-2024-0015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 08/08/2024] [Indexed: 09/10/2024]
Abstract
Given an orthographic transcription, forced alignment systems automatically determine boundaries between segments in speech, facilitating the use of large corpora. In the present paper, we introduce a neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). MAPS serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model as a tagger, rather than a classifier, motivated by the common understanding that segments are not truly discrete and often overlap. The second is an interpolation technique to allow more precise boundaries than the typical 10 ms limit in modern systems. During testing, all system configurations we trained significantly outperformed the state-of-the-art Montreal Forced Aligner in the 10 ms boundary placement tolerance threshold. The greatest difference achieved was a 28.13 % relative performance increase. The Montreal Forced Aligner began to slightly outperform our models at around a 30 ms tolerance. We also reflect on the training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciling this tension may require rethinking the task and output targets or how speech itself should be segmented.
Collapse
Affiliation(s)
- Matthew C Kelley
- Department of English, Linguistics Program, George Mason University 3298 , Fairfax, VA, USA
| | - Scott James Perry
- Department of Linguistics, University of Alberta, Edmonton, AB, Canada
| | - Benjamin V Tucker
- Department of Linguistics, University of Alberta, Edmonton, AB, Canada
- Department of Communication Sciences and Disorders, Northern Arizona University, Flagstaff, AZ, USA
| |
Collapse
|
2
|
Shahmohammadi H, Heitmeier M, Shafaei-Bajestan E, Lensch HPA, Baayen RH. Language with vision: A study on grounded word and sentence embeddings. Behav Res Methods 2024; 56:5622-5646. [PMID: 38114881 PMCID: PMC11335852 DOI: 10.3758/s13428-023-02294-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/09/2023] [Indexed: 12/21/2023]
Abstract
Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many attempts at language grounding, achieving an optimal equilibrium between textual representations of the language and our embodied experiences remains an open field. Some common concerns are the following. Is visual grounding advantageous for abstract words, or is its effectiveness restricted to concrete words? What is the optimal way of bridging the gap between text and vision? To what extent is perceptual knowledge from images advantageous for acquiring high-quality embeddings? Leveraging the current advances in machine learning and natural language processing, the present study addresses these questions by proposing a simple yet very effective computational grounding model for pre-trained word embeddings. Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information while simultaneously preserving the distributional statistics that characterize word usage in text corpora. By applying a learned alignment, we are able to indirectly ground unseen words including abstract words. A series of evaluations on a range of behavioral datasets shows that visual grounding is beneficial not only for concrete words but also for abstract words, lending support to the indirect theory of abstract concepts. Moreover, our approach offers advantages for contextualized embeddings, such as those generated by BERT (Devlin et al, 2018), but only when trained on corpora of modest, cognitively plausible sizes. Code and grounded embeddings for English are available at ( https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2 ).
Collapse
|
3
|
Hsiao JHW. Understanding Human Cognition Through Computational Modeling. Top Cogn Sci 2024; 16:349-376. [PMID: 38781432 DOI: 10.1111/tops.12737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 05/07/2024] [Accepted: 05/08/2024] [Indexed: 05/25/2024]
Abstract
One important goal of cognitive science is to understand the mind in terms of its representational and computational capacities, where computational modeling plays an essential role in providing theoretical explanations and predictions of human behavior and mental phenomena. In my research, I have been using computational modeling, together with behavioral experiments and cognitive neuroscience methods, to investigate the information processing mechanisms underlying learning and visual cognition in terms of perceptual representation and attention strategy. In perceptual representation, I have used neural network models to understand how the split architecture in the human visual system influences visual cognition, and to examine perceptual representation development as the results of expertise. In attention strategy, I have developed the Eye Movement analysis with Hidden Markov Models method for quantifying eye movement pattern and consistency using both spatial and temporal information, which has led to novel findings across disciplines not discoverable using traditional methods. By integrating it with deep neural networks (DNN), I have developed DNN+HMM to account for eye movement strategy learning in human visual cognition. The understanding of the human mind through computational modeling also facilitates research on artificial intelligence's (AI) comparability with human cognition, which can in turn help explainable AI systems infer humans' belief on AI's operations and provide human-centered explanations to enhance human-AI interaction and mutual understanding. Together, these demonstrate the essential role of computational modeling methods in providing theoretical accounts of the human mind as well as its interaction with its environment and AI systems.
Collapse
|
4
|
Fitz H, Hagoort P, Petersson KM. Neurobiological Causal Models of Language Processing. NEUROBIOLOGY OF LANGUAGE (CAMBRIDGE, MASS.) 2024; 5:225-247. [PMID: 38645618 PMCID: PMC11025648 DOI: 10.1162/nol_a_00133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 12/18/2023] [Indexed: 04/23/2024]
Abstract
The language faculty is physically realized in the neurobiological infrastructure of the human brain. Despite significant efforts, an integrated understanding of this system remains a formidable challenge. What is missing from most theoretical accounts is a specification of the neural mechanisms that implement language function. Computational models that have been put forward generally lack an explicit neurobiological foundation. We propose a neurobiologically informed causal modeling approach which offers a framework for how to bridge this gap. A neurobiological causal model is a mechanistic description of language processing that is grounded in, and constrained by, the characteristics of the neurobiological substrate. It intends to model the generators of language behavior at the level of implementational causality. We describe key features and neurobiological component parts from which causal models can be built and provide guidelines on how to implement them in model simulations. Then we outline how this approach can shed new light on the core computational machinery for language, the long-term storage of words in the mental lexicon and combinatorial processing in sentence comprehension. In contrast to cognitive theories of behavior, causal models are formulated in the "machine language" of neurobiology which is universal to human cognition. We argue that neurobiological causal modeling should be pursued in addition to existing approaches. Eventually, this approach will allow us to develop an explicit computational neurobiology of language.
Collapse
Affiliation(s)
- Hartmut Fitz
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
- Neurobiology of Language Department, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
| | - Peter Hagoort
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
- Neurobiology of Language Department, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
| | - Karl Magnus Petersson
- Neurobiology of Language Department, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Faculty of Medicine and Biomedical Sciences, University of Algarve, Faro, Portugal
| |
Collapse
|
5
|
Luthra S. Why are listeners hindered by talker variability? Psychon Bull Rev 2024; 31:104-121. [PMID: 37580454 PMCID: PMC10864679 DOI: 10.3758/s13423-023-02355-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/27/2023] [Indexed: 08/16/2023]
Abstract
Though listeners readily recognize speech from a variety of talkers, accommodating talker variability comes at a cost: Myriad studies have shown that listeners are slower to recognize a spoken word when there is talker variability compared with when talker is held constant. This review focuses on two possible theoretical mechanisms for the emergence of these processing penalties. One view is that multitalker processing costs arise through a resource-demanding talker accommodation process, wherein listeners compare sensory representations against hypothesized perceptual candidates and error signals are used to adjust the acoustic-to-phonetic mapping (an active control process known as contextual tuning). An alternative proposal is that these processing costs arise because talker changes involve salient stimulus-level discontinuities that disrupt auditory attention. Some recent data suggest that multitalker processing costs may be driven by both mechanisms operating over different time scales. Fully evaluating this claim requires a foundational understanding of both talker accommodation and auditory streaming; this article provides a primer on each literature and also reviews several studies that have observed multitalker processing costs. The review closes by underscoring a need for comprehensive theories of speech perception that better integrate auditory attention and by highlighting important considerations for future research in this area.
Collapse
Affiliation(s)
- Sahil Luthra
- Department of Psychology, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213, USA.
| |
Collapse
|
6
|
Magnuson JS, Crinnion AM, Luthra S, Gaston P, Grubb S. Contra assertions, feedback improves word recognition: How feedback and lateral inhibition sharpen signals over noise. Cognition 2024; 242:105661. [PMID: 37944313 PMCID: PMC11238470 DOI: 10.1016/j.cognition.2023.105661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 10/17/2023] [Accepted: 11/02/2023] [Indexed: 11/12/2023]
Abstract
Whether top-down feedback modulates perception has deep implications for cognitive theories. Debate has been vigorous in the domain of spoken word recognition, where competing computational models and agreement on at least one diagnostic experimental paradigm suggest that the debate may eventually be resolvable. Norris and Cutler (2021) revisit arguments against lexical feedback in spoken word recognition models. They also incorrectly claim that recent computational demonstrations that feedback promotes accuracy and speed under noise (Magnuson et al., 2018) were due to the use of the Luce choice rule rather than adding noise to inputs (noise was in fact added directly to inputs). They also claim that feedback cannot improve word recognition because feedback cannot distinguish signal from noise. We have two goals in this paper. First, we correct the record about the simulations of Magnuson et al. (2018). Second, we explain how interactive activation models selectively sharpen signals via joint effects of feedback and lateral inhibition that boost lexically-coherent sublexical patterns over noise. We also review a growing body of behavioral and neural results consistent with feedback and inconsistent with autonomous (non-feedback) architectures, and conclude that parsimony supports feedback. We close by discussing the potential for synergy between autonomous and interactive approaches.
Collapse
Affiliation(s)
- James S Magnuson
- University of Connecticut. Storrs, CT, USA; BCBL. Basque Center on Cognition Brain and Language, Donostia-San Sebastián, Spain; Ikerbasque. Basque Foundation for Science, Bilbao, Spain.
| | | | | | | | | |
Collapse
|
7
|
Xie X, Jaeger TF, Kurumada C. What we do (not) know about the mechanisms underlying adaptive speech perception: A computational framework and review. Cortex 2023; 166:377-424. [PMID: 37506665 DOI: 10.1016/j.cortex.2023.05.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 12/23/2022] [Accepted: 05/05/2023] [Indexed: 07/30/2023]
Abstract
Speech from unfamiliar talkers can be difficult to comprehend initially. These difficulties tend to dissipate with exposure, sometimes within minutes or less. Adaptivity in response to unfamiliar input is now considered a fundamental property of speech perception, and research over the past two decades has made substantial progress in identifying its characteristics. The mechanisms underlying adaptive speech perception, however, remain unknown. Past work has attributed facilitatory effects of exposure to any one of three qualitatively different hypothesized mechanisms: (1) low-level, pre-linguistic, signal normalization, (2) changes in/selection of linguistic representations, or (3) changes in post-perceptual decision-making. Direct comparisons of these hypotheses, or combinations thereof, have been lacking. We describe a general computational framework for adaptive speech perception (ASP) that-for the first time-implements all three mechanisms. We demonstrate how the framework can be used to derive predictions for experiments on perception from the acoustic properties of the stimuli. Using this approach, we find that-at the level of data analysis presently employed by most studies in the field-the signature results of influential experimental paradigms do not distinguish between the three mechanisms. This highlights the need for a change in research practices, so that future experiments provide more informative results. We recommend specific changes to experimental paradigms and data analysis. All data and code for this study are shared via OSF, including the R markdown document that this article is generated from, and an R library that implements the models we present.
Collapse
Affiliation(s)
- Xin Xie
- Language Science, University of California, Irvine, USA.
| | - T Florian Jaeger
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA; Computer Science, University of Rochester, Rochester, NY, USA
| | - Chigusa Kurumada
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA
| |
Collapse
|
8
|
Fernandez-Duque M, Hayakawa S, Marian V. Speakers of different languages remember visual scenes differently. SCIENCE ADVANCES 2023; 9:eadh0064. [PMID: 37585537 PMCID: PMC10431704 DOI: 10.1126/sciadv.adh0064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Accepted: 07/14/2023] [Indexed: 08/18/2023]
Abstract
Language can have a powerful effect on how people experience events. Here, we examine how the languages people speak guide attention and influence what they remember from a visual scene. When hearing a word, listeners activate other similar-sounding words before settling on the correct target. We tested whether this linguistic coactivation during a visual search task changes memory for objects. Bilinguals and monolinguals remembered English competitor words that overlapped phonologically with a spoken English target better than control objects without name overlap. High Spanish proficiency also enhanced memory for Spanish competitors that overlapped across languages. We conclude that linguistic diversity partly accounts for differences in higher cognitive functions such as memory, with multilinguals providing a fertile ground for studying the interaction between language and cognition.
Collapse
Affiliation(s)
- Matias Fernandez-Duque
- Department of Communication Sciences and Disorders, Northwestern University, Evanston, IL 60208, USA
| | - Sayuri Hayakawa
- Department of Communication Sciences and Disorders, Northwestern University, Evanston, IL 60208, USA
- Department of Psychology, Oklahoma State University, Stillwater, OK 74078, USA
| | - Viorica Marian
- Department of Communication Sciences and Disorders, Northwestern University, Evanston, IL 60208, USA
| |
Collapse
|
9
|
Hintz F, Voeten CC, Scharenborg O. Recognizing non-native spoken words in background noise increases interference from the native language. Psychon Bull Rev 2023; 30:1549-1563. [PMID: 36544064 PMCID: PMC10482792 DOI: 10.3758/s13423-022-02233-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/30/2022] [Indexed: 12/24/2022]
Abstract
Listeners frequently recognize spoken words in the presence of background noise. Previous research has shown that noise reduces phoneme intelligibility and hampers spoken-word recognition - especially for non-native listeners. In the present study, we investigated how noise influences lexical competition in both the non-native and the native language, reflecting the degree to which both languages are co-activated. We recorded the eye movements of native Dutch participants as they listened to English sentences containing a target word while looking at displays containing four objects. On target-present trials, the visual referent depicting the target word was present, along with three unrelated distractors. On target-absent trials, the target object (e.g., wizard) was absent. Instead, the display contained an English competitor, overlapping with the English target in phonological onset (e.g., window), a Dutch competitor, overlapping with the English target in phonological onset (e.g., wimpel, pennant), and two unrelated distractors. Half of the sentences was masked by speech-shaped noise; the other half was presented in quiet. Compared to speech in quiet, noise delayed fixations to the target objects on target-present trials. For target-absent trials, we observed that the likelihood for fixation biases towards the English and Dutch onset competitors (over the unrelated distractors) was larger in noise than in quiet. Our data thus show that the presence of background noise increases lexical competition in the task-relevant non-native (English) and in the task-irrelevant native (Dutch) language. The latter reflects stronger interference of one's native language during non-native spoken-word recognition under adverse conditions.
Collapse
Affiliation(s)
- Florian Hintz
- Max Planck Institute for Psycholinguistics, P.O. Box 310, 6500 AH, Nijmegen, The Netherlands.
| | | | - Odette Scharenborg
- Multimedia Computing Group, Delft University of Technology, Delft, Netherlands
| |
Collapse
|
10
|
Persson A, Jaeger TF. Evaluating normalization accounts against the dense vowel space of Central Swedish. Front Psychol 2023; 14:1165742. [PMID: 37416548 PMCID: PMC10322199 DOI: 10.3389/fpsyg.2023.1165742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 05/23/2023] [Indexed: 07/08/2023] Open
Abstract
Talkers vary in the phonetic realization of their vowels. One influential hypothesis holds that listeners overcome this inter-talker variability through pre-linguistic auditory mechanisms that normalize the acoustic or phonetic cues that form the input to speech recognition. Dozens of competing normalization accounts exist-including both accounts specific to vowel perception and general purpose accounts that can be applied to any type of cue. We add to the cross-linguistic literature on this matter by comparing normalization accounts against a new phonetically annotated vowel database of Swedish, a language with a particularly dense vowel inventory of 21 vowels differing in quality and quantity. We evaluate normalization accounts on how they differ in predicted consequences for perception. The results indicate that the best performing accounts either center or standardize formants by talker. The study also suggests that general purpose accounts perform as well as vowel-specific accounts, and that vowel normalization operates in both temporal and spectral domains.
Collapse
Affiliation(s)
- Anna Persson
- Department of Swedish Language and Multilingualism, Stockholm University, Stockholm, Sweden
| | - T. Florian Jaeger
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, United States
- Computer Science, University of Rochester, Rochester, NY, United States
| |
Collapse
|
11
|
Ishikawa K, Pietrowicz M, Charney S, Orbelo D. Landmark-based analysis of speech differentiates conversational from clear speech in speakers with muscle tension dysphonia. JASA EXPRESS LETTERS 2023; 3:2888596. [PMID: 37140265 DOI: 10.1121/10.0019354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 04/18/2023] [Indexed: 05/05/2023]
Abstract
This study evaluated the feasibility of differentiating conversational and clear speech produced by individuals with muscle tension dysphonia (MTD) using landmark-based analysis of speech (LMBAS). Thirty-four adult speakers with MTD recorded conversational and clear speech, with 27 of them able to produce clear speech. The recordings of these individuals were analyzed with the open-source LMBAS program, SpeechMark®, matlab Toolbox version 1.1.2. The results indicated that glottal landmarks, burst onset landmarks, and the duration between glottal landmarks differentiated conversational speech from clear speech. LMBAS shows potential as an approach for detecting the difference between conversational and clear speech in dysphonic individuals.
Collapse
Affiliation(s)
- Keiko Ishikawa
- Department of Communication Sciences and Disorders, University of Kentucky, 900 South Limestone, Lexington, Kentucky 40536-0200, USA
| | - Mary Pietrowicz
- Applied Research Institute, University of Illinois at Urbana-Champaign 2100 South Oak Street, Suite 206, Champaign, Illinois 61820, USA
| | - Sara Charney
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic Arizona, 5777 East Mayo Boulevard, Phoenix, Arizona 85054, USA
| | - Diana Orbelo
- Department of Otolaryngology-Head and Neck Surgery, Mayo Medical School, 200 1st Street Southwest, Rochester, Minnesota 55905, , , ,
| |
Collapse
|
12
|
Beguš G, Zhou A, Zhao TC. Encoding of speech in convolutional layers and the brain stem based on language experience. Sci Rep 2023; 13:6480. [PMID: 37081119 PMCID: PMC10119295 DOI: 10.1038/s41598-023-33384-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 04/12/2023] [Indexed: 04/22/2023] Open
Abstract
Comparing artificial neural networks with outputs of neuroimaging techniques has recently seen substantial advances in (computer) vision and text-based language models. Here, we propose a framework to compare biological and artificial neural computations of spoken language representations and propose several new challenges to this paradigm. The proposed technique is based on a similar principle that underlies electroencephalography (EEG): averaging of neural (artificial or biological) activity across neurons in the time domain, and allows to compare encoding of any acoustic property in the brain and in intermediate convolutional layers of an artificial neural network. Our approach allows a direct comparison of responses to a phonetic property in the brain and in deep neural networks that requires no linear transformations between the signals. We argue that the brain stem response (cABR) and the response in intermediate convolutional layers to the exact same stimulus are highly similar without applying any transformations, and we quantify this observation. The proposed technique not only reveals similarities, but also allows for analysis of the encoding of actual acoustic properties in the two signals: we compare peak latency (i) in cABR relative to the stimulus in the brain stem and in (ii) intermediate convolutional layers relative to the input/output in deep convolutional networks. We also examine and compare the effect of prior language exposure on the peak latency in cABR and in intermediate convolutional layers. Substantial similarities in peak latency encoding between the human brain and intermediate convolutional networks emerge based on results from eight trained networks (including a replication experiment). The proposed technique can be used to compare encoding between the human brain and intermediate convolutional layers for any acoustic property and for other neuroimaging techniques.
Collapse
Affiliation(s)
- Gašper Beguš
- Department of Linguistics, University of California, Berkeley, USA.
| | - Alan Zhou
- Department of Cognitive Science, Johns Hopkins University, Baltimore, USA
| | - T Christina Zhao
- Institute for Learning and Brain Sciences, University of Washington, Seattle, USA
- Department of Speech and Hearing Sciences, University of Washington, Seattle, USA
| |
Collapse
|
13
|
Spivey MJ. Cognitive Science Progresses Toward Interactive Frameworks. Top Cogn Sci 2023; 15:219-254. [PMID: 36949655 PMCID: PMC10123086 DOI: 10.1111/tops.12645] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 02/27/2023] [Accepted: 02/27/2023] [Indexed: 03/24/2023]
Abstract
Despite its many twists and turns, the arc of cognitive science generally bends toward progress, thanks to its interdisciplinary nature. By glancing at the last few decades of experimental and computational advances, it can be argued that-far from failing to converge on a shared set of conceptual assumptions-the field is indeed making steady consensual progress toward what can broadly be referred to as interactive frameworks. This inclination is apparent in the subfields of psycholinguistics, visual perception, embodied cognition, extended cognition, neural networks, dynamical systems theory, and more. This pictorial essay briefly documents this steady progress both from a bird's eye view and from the trenches. The conclusion is one of optimism that cognitive science is getting there, albeit slowly and arduously, like any good science should.
Collapse
Affiliation(s)
- Michael J Spivey
- Department of Cognitive and Information Sciences, University of California, Merced
| |
Collapse
|
14
|
Avcu E, Hwang M, Brown KS, Gow DW. A tale of two lexica: Investigating computational pressures on word representation with neural networks. Front Artif Intell 2023; 6:1062230. [PMID: 37051161 PMCID: PMC10083378 DOI: 10.3389/frai.2023.1062230] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 03/10/2023] [Indexed: 03/28/2023] Open
Abstract
Introduction The notion of a single localized store of word representations has become increasingly less plausible as evidence has accumulated for the widely distributed neural representation of wordform grounded in motor, perceptual, and conceptual processes. Here, we attempt to combine machine learning methods and neurobiological frameworks to propose a computational model of brain systems potentially responsible for wordform representation. We tested the hypothesis that the functional specialization of word representation in the brain is driven partly by computational optimization. This hypothesis directly addresses the unique problem of mapping sound and articulation vs. mapping sound and meaning. Results We found that artificial neural networks trained on the mapping between sound and articulation performed poorly in recognizing the mapping between sound and meaning and vice versa. Moreover, a network trained on both tasks simultaneously could not discover the features required for efficient mapping between sound and higher-level cognitive states compared to the other two models. Furthermore, these networks developed internal representations reflecting specialized task-optimized functions without explicit training. Discussion Together, these findings demonstrate that different task-directed representations lead to more focused responses and better performance of a machine or algorithm and, hypothetically, the brain. Thus, we imply that the functional specialization of word representation mirrors a computational optimization strategy given the nature of the tasks that the human brain faces.
Collapse
Affiliation(s)
- Enes Avcu
- Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, United States
| | | | - Kevin Scott Brown
- Department of Pharmaceutical Sciences and School of Chemical, Biological, and Environmental Engineering, Oregon State University, Corvallis, OR, United States
| | - David W. Gow
- Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, United States
- Athinoula A. Martinos Center for Biomedical Imaging Massachusetts General Hospital, Charlestown, MA, United States
- Department of Psychology, Salem State University, Salem, MA, United States
- Harvard-MIT Division of Health Sciences and Technology, Boston, MA, United States
| |
Collapse
|
15
|
Sadagopan S, Kar M, Parida S. Quantitative models of auditory cortical processing. Hear Res 2023; 429:108697. [PMID: 36696724 PMCID: PMC9928778 DOI: 10.1016/j.heares.2023.108697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 12/17/2022] [Accepted: 01/12/2023] [Indexed: 01/15/2023]
Abstract
To generate insight from experimental data, it is critical to understand the inter-relationships between individual data points and place them in context within a structured framework. Quantitative modeling can provide the scaffolding for such an endeavor. Our main objective in this review is to provide a primer on the range of quantitative tools available to experimental auditory neuroscientists. Quantitative modeling is advantageous because it can provide a compact summary of observed data, make underlying assumptions explicit, and generate predictions for future experiments. Quantitative models may be developed to characterize or fit observed data, to test theories of how a task may be solved by neural circuits, to determine how observed biophysical details might contribute to measured activity patterns, or to predict how an experimental manipulation would affect neural activity. In complexity, quantitative models can range from those that are highly biophysically realistic and that include detailed simulations at the level of individual synapses, to those that use abstract and simplified neuron models to simulate entire networks. Here, we survey the landscape of recently developed models of auditory cortical processing, highlighting a small selection of models to demonstrate how they help generate insight into the mechanisms of auditory processing. We discuss examples ranging from models that use details of synaptic properties to explain the temporal pattern of cortical responses to those that use modern deep neural networks to gain insight into human fMRI data. We conclude by discussing a biologically realistic and interpretable model that our laboratory has developed to explore aspects of vocalization categorization in the auditory pathway.
Collapse
Affiliation(s)
- Srivatsun Sadagopan
- Department of Neurobiology, University of Pittsburgh, Pittsburgh, PA, USA; Center for Neuroscience, University of Pittsburgh, Pittsburgh, PA, USA; Center for the Neural Basis of Cognition, University of Pittsburgh, Pittsburgh, PA, USA; Department of Bioengineering, University of Pittsburgh, Pittsburgh, PA, USA; Department of Communication Science and Disorders, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Manaswini Kar
- Department of Neurobiology, University of Pittsburgh, Pittsburgh, PA, USA; Center for Neuroscience, University of Pittsburgh, Pittsburgh, PA, USA; Center for the Neural Basis of Cognition, University of Pittsburgh, Pittsburgh, PA, USA
| | - Satyabrata Parida
- Department of Neurobiology, University of Pittsburgh, Pittsburgh, PA, USA; Center for Neuroscience, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
16
|
Nenadić F, Tucker BV, Ten Bosch L. Computational Modeling of an Auditory Lexical Decision Experiment Using DIANA. LANGUAGE AND SPEECH 2022:238309221111752. [PMID: 36000386 PMCID: PMC10394956 DOI: 10.1177/00238309221111752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We present an implementation of DIANA, a computational model of spoken word recognition, to model responses collected in the Massive Auditory Lexical Decision (MALD) project. DIANA is an end-to-end model, including an activation and decision component that takes the acoustic signal as input, activates internal word representations, and outputs lexicality judgments and estimated response latencies. Simulation 1 presents the process of creating acoustic models required by DIANA to analyze novel speech input. Simulation 2 investigates DIANA's performance in determining whether the input signal is a word present in the lexicon or a pseudoword. In Simulation 3, we generate estimates of response latency and correlate them with general tendencies in participant responses in MALD data. We find that DIANA performs fairly well in free word recognition and lexical decision. However, the current approach for estimating response latency provides estimates opposite to those found in behavioral data. We discuss these findings and offer suggestions as to what a contemporary model of spoken word recognition should be able to do.
Collapse
Affiliation(s)
- Filip Nenadić
- University of Alberta, Canada; Singidunum University, Serbia
| | | | | |
Collapse
|
17
|
DIANA, a Process-Oriented Model of Human Auditory Word Recognition. Brain Sci 2022; 12:brainsci12050681. [PMID: 35625067 PMCID: PMC9140177 DOI: 10.3390/brainsci12050681] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 05/05/2022] [Accepted: 05/10/2022] [Indexed: 02/04/2023] Open
Abstract
This article presents DIANA, a new, process-oriented model of human auditory word recognition, which takes as its input the acoustic signal and can produce as its output word identifications and lexicality decisions, as well as reaction times. This makes it possible to compare its output with human listeners’ behavior in psycholinguistic experiments. DIANA differs from existing models in that it takes more available neuro-physiological evidence on speech processing into account. For instance, DIANA accounts for the effect of ambiguity in the acoustic signal on reaction times following the Hick–Hyman law and it interprets the acoustic signal in the form of spectro-temporal receptive fields, which are attested in the human superior temporal gyrus, instead of in the form of abstract phonological units. The model consists of three components: activation, decision and execution. The activation and decision components are described in detail, both at the conceptual level (in the running text) and at the computational level (in the Appendices). While the activation component is independent of the listener’s task, the functioning of the decision component depends on this task. The article also describes how DIANA could be improved in the future in order to even better resemble the behavior of human listeners.
Collapse
|
18
|
Strauß A, Wu T, McQueen JM, Scharenborg O, Hintz F. The differential roles of lexical and sublexical processing during spoken-word recognition in clear and in noise. Cortex 2022; 151:70-88. [DOI: 10.1016/j.cortex.2022.02.011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 01/21/2022] [Accepted: 02/13/2022] [Indexed: 02/03/2023]
|
19
|
Karaminis T, Hintz F, Scharenborg O. The Presence of Background Noise Extends the Competitor Space in Native and Non-Native Spoken-Word Recognition: Insights from Computational Modeling. Cogn Sci 2022; 46:e13110. [PMID: 35188686 PMCID: PMC9286693 DOI: 10.1111/cogs.13110] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 12/17/2021] [Accepted: 01/23/2022] [Indexed: 11/29/2022]
Abstract
Oral communication often takes place in noisy environments, which challenge spoken‐word recognition. Previous research has suggested that the presence of background noise extends the number of candidate words competing with the target word for recognition and that this extension affects the time course and accuracy of spoken‐word recognition. In this study, we further investigated the temporal dynamics of competition processes in the presence of background noise, and how these vary in listeners with different language proficiency (i.e., native and non‐native) using computational modeling. We developed ListenIN (Listen‐In‐Noise), a neural‐network model based on an autoencoder architecture, which learns to map phonological forms onto meanings in two languages and simulates native and non‐native spoken‐word comprehension. We also examined the model's activation states during online spoken‐word recognition. These analyses demonstrated that the presence of background noise increases the number of competitor words, which are engaged in phonological competition and that this happens in similar ways intra and interlinguistically and in native and non‐native listening. Taken together, our results support accounts positing a “many‐additional‐competitors scenario” for the effects of noise on spoken‐word recognition.
Collapse
Affiliation(s)
| | - Florian Hintz
- Department of Psychology of Language, Max Planck Institute for Psycholinguistics
| | | |
Collapse
|
20
|
Kurumada C, Roettger TB. Thinking probabilistically in the study of intonational speech prosody. WILEY INTERDISCIPLINARY REVIEWS. COGNITIVE SCIENCE 2021; 13:e1579. [PMID: 34599647 DOI: 10.1002/wcs.1579] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 08/09/2021] [Accepted: 08/26/2021] [Indexed: 11/07/2022]
Abstract
Speech prosody, the melodic and rhythmic properties of a language, plays a critical role in our everyday communication. Researchers have identified unique patterns of prosody that segment words and phrases, highlight focal elements in a sentence, and convey holistic meanings and speech acts that interact with the information shared in context. The mapping between the sound and meaning represented in prosody is suggested to be probabilistic-the same physical instance of sounds can support multiple meanings across talkers and contexts while the same meaning can be encoded in physically distinct sound patterns (e.g., pitch movements). The current overview presents an analysis framework for probing the nature of this probabilistic relationship. Illustrated by examples from the literature and a dataset of German focus marking, we discuss the production variability within and across talkers and consider challenges that this variability imposes on the comprehension system. A better understanding of these challenges, we argue, will illuminate how the human perceptual, cognitive, and computational mechanisms may navigate the variability to arrive at a coherent understanding of speech prosody. The current paper is intended to be an introduction for those who are interested in thinking probabilistically about the sound-meaning mapping in prosody. Open questions for future research are discussed with proposals for examining prosodic production and comprehension within a comprehensive, mathematically-motivated framework of probabilistic inference under uncertainty. This article is categorized under: Linguistics > Language in Mind and Brain Psychology > Language.
Collapse
Affiliation(s)
- Chigusa Kurumada
- Department of Brain and Cognitive Sciences, University of Rochester, Rochester, New York, USA
| | - Timo B Roettger
- Department of Linguistics & Scandinavian Studies, Universitetet i Oslo, Oslo, Norway
| |
Collapse
|
21
|
Falandays JB, Nguyen B, Spivey MJ. Is prediction nothing more than multi-scale pattern completion of the future? Brain Res 2021; 1768:147578. [PMID: 34284021 DOI: 10.1016/j.brainres.2021.147578] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 05/28/2021] [Accepted: 06/29/2021] [Indexed: 11/18/2022]
Abstract
While the notion of the brain as a prediction machine has been extremely influential and productive in cognitive science, there are competing accounts of how best to model and understand the predictive capabilities of brains. One prominent framework is of a "Bayesian brain" that explicitly generates predictions and uses resultant errors to guide adaptation. We suggest that the prediction-generation component of this framework may involve little more than a pattern completion process. We first describe pattern completion in the domain of visual perception, highlighting its temporal extension, and show how this can entail a form of prediction in time. Next, we describe the forward momentum of entrained dynamical systems as a model for the emergence of predictive processing in non-predictive systems. Then, we apply this reasoning to the domain of language, where explicitly predictive models are perhaps most popular. Here, we demonstrate how a connectionist model, TRACE, exhibits hallmarks of predictive processing without any representations of predictions or errors. Finally, we present a novel neural network model, inspired by reservoir computing models, that is entirely unsupervised and memoryless, but nonetheless exhibits prediction-like behavior in its pursuit of homeostasis. These explorations demonstrate that brain-like systems can get prediction "for free," without the need to posit formal logical representations with Bayesian probabilities or an inference machine that holds them in working memory.
Collapse
Affiliation(s)
- J Benjamin Falandays
- Department of Cognitive and Information Sciences, University of California, Merced, United States
| | - Benjamin Nguyen
- Department of Cognitive and Information Sciences, University of California, Merced, United States
| | - Michael J Spivey
- Department of Cognitive and Information Sciences, University of California, Merced, United States.
| |
Collapse
|
22
|
Does signal reduction imply predictive coding in models of spoken word recognition? Psychon Bull Rev 2021; 28:1381-1389. [PMID: 33852158 PMCID: PMC8367925 DOI: 10.3758/s13423-021-01924-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/24/2021] [Indexed: 12/29/2022]
Abstract
Pervasive behavioral and neural evidence for predictive processing has led to claims that language processing depends upon predictive coding. Formally, predictive coding is a computational mechanism where only deviations from top-down expectations are passed between levels of representation. In many cognitive neuroscience studies, a reduction of signal for expected inputs is taken as being diagnostic of predictive coding. In the present work, we show that despite not explicitly implementing prediction, the TRACE model of speech perception exhibits this putative hallmark of predictive coding, with reductions in total lexical activation, total lexical feedback, and total phoneme activation when the input conforms to expectations. These findings may indicate that interactive activation is functionally equivalent or approximant to predictive coding or that caution is warranted in interpreting neural signal reduction as diagnostic of predictive coding.
Collapse
|
23
|
Fox NP, Leonard M, Sjerps MJ, Chang EF. Transformation of a temporal speech cue to a spatial neural code in human auditory cortex. eLife 2020; 9:e53051. [PMID: 32840483 PMCID: PMC7556862 DOI: 10.7554/elife.53051] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Accepted: 08/21/2020] [Indexed: 11/28/2022] Open
Abstract
In speech, listeners extract continuously-varying spectrotemporal cues from the acoustic signal to perceive discrete phonetic categories. Spectral cues are spatially encoded in the amplitude of responses in phonetically-tuned neural populations in auditory cortex. It remains unknown whether similar neurophysiological mechanisms encode temporal cues like voice-onset time (VOT), which distinguishes sounds like /b/ and/p/. We used direct brain recordings in humans to investigate the neural encoding of temporal speech cues with a VOT continuum from /ba/ to /pa/. We found that distinct neural populations respond preferentially to VOTs from one phonetic category, and are also sensitive to sub-phonetic VOT differences within a population's preferred category. In a simple neural network model, simulated populations tuned to detect either temporal gaps or coincidences between spectral cues captured encoding patterns observed in real neural data. These results demonstrate that a spatial/amplitude neural code underlies the cortical representation of both spectral and temporal speech cues.
Collapse
Affiliation(s)
- Neal P Fox
- Department of Neurological Surgery, University of California, San FranciscoSan FranciscoUnited States
| | - Matthew Leonard
- Department of Neurological Surgery, University of California, San FranciscoSan FranciscoUnited States
| | - Matthias J Sjerps
- Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Radboud UniversityNijmegenNetherlands
- Max Planck Institute for PsycholinguisticsNijmegenNetherlands
| | - Edward F Chang
- Department of Neurological Surgery, University of California, San FranciscoSan FranciscoUnited States
- Weill Institute for Neurosciences, University of California, San FranciscoSan FranciscoUnited States
| |
Collapse
|