1
|
Chernyak BR, Bradlow AR, Keshet J, Goldrick M. A perceptual similarity space for speech based on self-supervised speech representations. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:3915-3929. [PMID: 38904539 DOI: 10.1121/10.0026358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Accepted: 05/29/2024] [Indexed: 06/22/2024]
Abstract
Speech recognition by both humans and machines frequently fails in non-optimal yet common situations. For example, word recognition error rates for second-language (L2) speech can be high, especially under conditions involving background noise. At the same time, both human and machine speech recognition sometimes shows remarkable robustness against signal- and noise-related degradation. Which acoustic features of speech explain this substantial variation in intelligibility? Current approaches align speech to text to extract a small set of pre-defined spectro-temporal properties from specific sounds in particular words. However, variation in these properties leaves much cross-talker variation in intelligibility unexplained. We examine an alternative approach utilizing a perceptual similarity space acquired using self-supervised learning. This approach encodes distinctions between speech samples without requiring pre-defined acoustic features or speech-to-text alignment. We show that L2 English speech samples are less tightly clustered in the space than L1 samples reflecting variability in English proficiency among L2 talkers. Critically, distances in this similarity space are perceptually meaningful: L1 English listeners have lower recognition accuracy for L2 speakers whose speech is more distant in the space from L1 speech. These results indicate that perceptual similarity may form the basis for an entirely new speech and language analysis approach.
Collapse
Affiliation(s)
- Bronya R Chernyak
- Faculty of Electrical & Computer Engineering, Technion-Israel Institute of Technology, Haifa 3200003, Israel
| | - Ann R Bradlow
- Department of Linguistics, Northwestern University, Evanston, Illinois 60208, USA
| | - Joseph Keshet
- Faculty of Electrical & Computer Engineering, Technion-Israel Institute of Technology, Haifa 3200003, Israel
| | - Matthew Goldrick
- Department of Linguistics, Northwestern University, Evanston, Illinois 60208, USA
| |
Collapse
|
2
|
Hao Wang F, Luo M, Wang S. Statistical word segmentation succeeds given the minimal amount of exposure. Psychon Bull Rev 2024; 31:1172-1180. [PMID: 37884777 DOI: 10.3758/s13423-023-02386-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/10/2023] [Indexed: 10/28/2023]
Abstract
One of the first tasks in language acquisition is word segmentation, a process to extract word forms from continuous speech streams. Statistical approaches to word segmentation have been shown to be a powerful mechanism, in which word boundaries are inferred from sequence statistics. This approach requires the learner to represent the frequency of units from syllable sequences, though accounts differ on how much statistical exposure is required. In this study, we examined the computational limit with which words can be extracted from continuous sequences. First, we discussed why two occurrences of a word in a continuous sequence is the computational lower limit for this word to be statistically defined. Next, we created short syllable sequences that contained certain words either two or four times. Learners were presented with these syllable sequences one at a time, immediately followed by a test of the novel words from these sequences. We found that, with the computationally minimal amount of two exposures, words were successfully segmented from continuous sequences. Moreover, longer syllable sequences providing four exposures to words generated more robust learning results. The implications of these results are discussed in terms of how learners segment and store the word candidates from continuous sequences.
Collapse
Affiliation(s)
- Felix Hao Wang
- School of Psychology, Nanjing Normal University, Nanjing, Jiangsu, China.
| | - Meili Luo
- School of Psychology, Nanjing Normal University, Nanjing, Jiangsu, China
| | - Suiping Wang
- Philosophy and Social Science Laboratory of Reading and Development in Children and Adolescents, South China Normal University, Ministry of Education, Guangzhou, China.
| |
Collapse
|
3
|
Qi W, Zevin JD. Statistical learning of syllable sequences as trajectories through a perceptual similarity space. Cognition 2024; 244:105689. [PMID: 38219453 DOI: 10.1016/j.cognition.2023.105689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 12/01/2023] [Accepted: 12/06/2023] [Indexed: 01/16/2024]
Abstract
Learning from sequential statistics is a general capacity common across many cognitive domains and species. One form of statistical learning (SL) - learning to segment "words" from continuous streams of speech syllables in which the only segmentation cue is ostensibly the transitional (or conditional) probability from one syllable to the next - has been studied in great detail. Typically, this phenomenon is modeled as the calculation of probabilities over discrete, featureless units. Here we present an alternative model, in which sequences are learned as trajectories through a similarity space. A simple recurrent network coding syllables with representations that capture the similarity relations among them correctly simulated the result of a classic SL study, as did a similar model that encoded syllables as three dimensional points in a continuous similarity space. We then used the simulations to identify a sequence of "words" that produces the reverse of the typical SL effect, i.e., part-words are predicted to be more familiar than Words. Results from two experiments with human participants are consistent with simulation results. Additional analyses identified features that drive differences in what is learned from a set of artificial languages that have the same transitional probabilities among syllables.
Collapse
Affiliation(s)
- Wendy Qi
- Department of Psychology, University of Southern California, 3620 S. McClintock Ave, Los Angeles, CA 90089, United States
| | - Jason D Zevin
- Department of Psychology, University of Southern California, 3620 S. McClintock Ave, Los Angeles, CA 90089, United States.
| |
Collapse
|
4
|
Swingley D, Algayres R. Computational Modeling of the Segmentation of Sentence Stimuli From an Infant Word-Finding Study. Cogn Sci 2024; 48:e13427. [PMID: 38528789 DOI: 10.1111/cogs.13427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 02/22/2024] [Accepted: 02/24/2024] [Indexed: 03/27/2024]
Abstract
Computational models of infant word-finding typically operate over transcriptions of infant-directed speech corpora. It is now possible to test models of word segmentation on speech materials, rather than transcriptions of speech. We propose that such modeling efforts be conducted over the speech of the experimental stimuli used in studies measuring infants' capacity for learning from spoken sentences. Correspondence with infant outcomes in such experiments is an appropriate benchmark for models of infants. We demonstrate such an analysis by applying the DP-Parser model of Algayres and colleagues to auditory stimuli used in infant psycholinguistic experiments by Pelucchi and colleagues. The DP-Parser model takes speech as input, and creates multiple overlapping embeddings from each utterance. Prospective words are identified as clusters of similar embedded segments. This allows segmentation of each utterance into possible words, using a dynamic programming method that maximizes the frequency of constituent segments. We show that DP-Parse mimics American English learners' performance in extracting words from Italian sentences, favoring the segmentation of words with high syllabic transitional probability. This kind of computational analysis over actual stimuli from infant experiments may be helpful in tuning future models to match human performance.
Collapse
|
5
|
Wang FH, Luo M, Wang S. Perceptual intake explains variability in statistical word segmentation. Cognition 2023; 241:105612. [PMID: 37738711 DOI: 10.1016/j.cognition.2023.105612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 08/04/2023] [Accepted: 09/03/2023] [Indexed: 09/24/2023]
Abstract
One of the first problems in language learning is to segment words from continuous speech. Both prosodic and distributional information can be useful, and it is an important question how the two types of information are integrated. In this paper, we propose that the distinction between input (the statistical properties of the syllable sequence), and intake (how learners perceptually represent the syllable sequence) is a useful framework to integrate different sources of information. We took a novel approach, observing how a large number of syllable sequences were segmented. These sequences had the same transitional probability information for finding word boundaries but different syllables in them. We found large variability in the performance of the segmentation task, suggesting that factors other than the statistical properties of sequences were at play. This variability was explored using the input/intake asymmetry framework, which predicted that factors that shaped the representation of different syllable sequences could explain the variability of learning. We examined two factors, the saliency of the rhythm in these syllable sequences and how familiar the novel word forms in the sequence were to the existing lexicon. Both factors explained the variance in the learnability of different sequences, suggesting that processing of the sequences shaped learning. The implications of these results to computational models of statistical learning and broader implications to language learning were discussed.
Collapse
Affiliation(s)
- Felix Hao Wang
- School of Psychology, Nanjing Normal University, Nanjing, Jiangsu, China.
| | - Meili Luo
- School of Psychology, Nanjing Normal University, Nanjing, Jiangsu, China
| | - Suiping Wang
- Philosophy and Social Science Laboratory of Reading and Development in Children and Adolescents (South China Normal University), Ministry of Education, China.
| |
Collapse
|
6
|
Henin S, Turk-Browne NB, Friedman D, Liu A, Dugan P, Flinker A, Doyle W, Devinsky O, Melloni L. Learning hierarchical sequence representations across human cortex and hippocampus. SCIENCE ADVANCES 2021; 7:eabc4530. [PMID: 33608265 PMCID: PMC7895424 DOI: 10.1126/sciadv.abc4530] [Citation(s) in RCA: 74] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 01/07/2021] [Indexed: 05/03/2023]
Abstract
Sensory input arrives in continuous sequences that humans experience as segmented units, e.g., words and events. The brain's ability to discover regularities is called statistical learning. Structure can be represented at multiple levels, including transitional probabilities, ordinal position, and identity of units. To investigate sequence encoding in cortex and hippocampus, we recorded from intracranial electrodes in human subjects as they were exposed to auditory and visual sequences containing temporal regularities. We find neural tracking of regularities within minutes, with characteristic profiles across brain areas. Early processing tracked lower-level features (e.g., syllables) and learned units (e.g., words), while later processing tracked only learned units. Learning rapidly shaped neural representations, with a gradient of complexity from early brain areas encoding transitional probability, to associative regions and hippocampus encoding ordinal position and identity of units. These findings indicate the existence of multiple, parallel computational systems for sequence learning across hierarchically organized cortico-hippocampal circuits.
Collapse
Affiliation(s)
- Simon Henin
- New York University Comprehensive Epilepsy Center, 223 34th Street, New York, NY 10016, USA.
- Department of Neurology, New York University School of Medicine, 240 East 38th Street, 20th Floor, New York, NY 10016, USA
| | | | - Daniel Friedman
- New York University Comprehensive Epilepsy Center, 223 34th Street, New York, NY 10016, USA
- Department of Neurology, New York University School of Medicine, 240 East 38th Street, 20th Floor, New York, NY 10016, USA
| | - Anli Liu
- New York University Comprehensive Epilepsy Center, 223 34th Street, New York, NY 10016, USA
- Department of Neurology, New York University School of Medicine, 240 East 38th Street, 20th Floor, New York, NY 10016, USA
| | - Patricia Dugan
- New York University Comprehensive Epilepsy Center, 223 34th Street, New York, NY 10016, USA
- Department of Neurology, New York University School of Medicine, 240 East 38th Street, 20th Floor, New York, NY 10016, USA
| | - Adeen Flinker
- New York University Comprehensive Epilepsy Center, 223 34th Street, New York, NY 10016, USA
- Department of Neurology, New York University School of Medicine, 240 East 38th Street, 20th Floor, New York, NY 10016, USA
| | - Werner Doyle
- New York University Comprehensive Epilepsy Center, 223 34th Street, New York, NY 10016, USA
- Department of Neurology, New York University School of Medicine, 240 East 38th Street, 20th Floor, New York, NY 10016, USA
| | - Orrin Devinsky
- New York University Comprehensive Epilepsy Center, 223 34th Street, New York, NY 10016, USA
- Department of Neurology, New York University School of Medicine, 240 East 38th Street, 20th Floor, New York, NY 10016, USA
| | - Lucia Melloni
- New York University Comprehensive Epilepsy Center, 223 34th Street, New York, NY 10016, USA.
- Department of Neurology, New York University School of Medicine, 240 East 38th Street, 20th Floor, New York, NY 10016, USA
- Department of Neuroscience, Max Planck Institute for Empirical Aesthetics, Grüneburgweg 14, 60322 Frankfurt am Main, Germany
| |
Collapse
|