1
|
Naeini SA, Simmatis L, Jafari D, Yunusova Y, Taati B. Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation. IEEE J Transl Eng Health Med 2024; 12:382-389. [PMID: 38606392 PMCID: PMC11008804 DOI: 10.1109/jtehm.2024.3375323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Revised: 02/21/2024] [Accepted: 03/02/2024] [Indexed: 04/13/2024]
Abstract
Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.
Collapse
Affiliation(s)
- Saeid Alavi Naeini
- KITE, Toronto Rehabilitation Institute, University Health Network (UHN)TorontoONM5G 2A2Canada
- Institute of Biomedical Engineering, University of TorontoTorontoONM5S 3G9Canada
| | - Leif Simmatis
- KITE, Toronto Rehabilitation Institute, University Health Network (UHN)TorontoONM5G 2A2Canada
| | - Deniz Jafari
- KITE, Toronto Rehabilitation Institute, University Health Network (UHN)TorontoONM5G 2A2Canada
- Institute of Biomedical Engineering, University of TorontoTorontoONM5S 3G9Canada
| | - Yana Yunusova
- KITE, Toronto Rehabilitation Institute, University Health Network (UHN)TorontoONM5G 2A2Canada
- Department of Speech Language PathologyRehabilitation Sciences Institute, University of TorontoTorontoONM5G 1V7Canada
- Hurvitz Brain Sciences ProgramSunnybrook Research Institute (SRI)TorontoONM4N 3M5Canada
| | - Babak Taati
- KITE, Toronto Rehabilitation Institute, University Health Network (UHN)TorontoONM5G 2A2Canada
- Institute of Biomedical Engineering, University of TorontoTorontoONM5S 3G9Canada
- Department of Computer ScienceUniversity of TorontoTorontoONM5S 2E4Canada
| |
Collapse
|
2
|
Dal Ben R, Prequero IT, Souza DDH, Hay JF. Speech Segmentation and Cross-Situational Word Learning in Parallel. Open Mind (Camb) 2023; 7:510-533. [PMID: 37637304 PMCID: PMC10449405 DOI: 10.1162/opmi_a_00095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 07/06/2023] [Indexed: 08/29/2023] Open
Abstract
Language learners track conditional probabilities to find words in continuous speech and to map words and objects across ambiguous contexts. It remains unclear, however, whether learners can leverage the structure of the linguistic input to do both tasks at the same time. To explore this question, we combined speech segmentation and cross-situational word learning into a single task. In Experiment 1, when adults (N = 60) simultaneously segmented continuous speech and mapped the newly segmented words to objects, they demonstrated better performance than when either task was performed alone. However, when the speech stream had conflicting statistics, participants were able to correctly map words to objects, but were at chance level on speech segmentation. In Experiment 2, we used a more sensitive speech segmentation measure to find that adults (N = 35), exposed to the same conflicting speech stream, correctly identified non-words as such, but were still unable to discriminate between words and part-words. Again, mapping was above chance. Our study suggests that learners can track multiple sources of statistical information to find and map words to objects in noisy environments. It also prompts questions on how to effectively measure the knowledge arising from these learning experiences.
Collapse
Affiliation(s)
- Rodrigo Dal Ben
- Universidade Federal de São Carlos, São Carlos, São Paulo, Brazil
| | | | | | | |
Collapse
|
3
|
Menn KH, Ward EK, Braukmann R, van den Boomen C, Buitelaar J, Hunnius S, Snijders TM. Neural Tracking in Infancy Predicts Language Development in Children With and Without Family History of Autism. Neurobiol Lang (Camb) 2022; 3:495-514. [PMID: 37216063 PMCID: PMC10158647 DOI: 10.1162/nol_a_00074] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 05/16/2022] [Indexed: 05/24/2023]
Abstract
During speech processing, neural activity in non-autistic adults and infants tracks the speech envelope. Recent research in adults indicates that this neural tracking relates to linguistic knowledge and may be reduced in autism. Such reduced tracking, if present already in infancy, could impede language development. In the current study, we focused on children with a family history of autism, who often show a delay in first language acquisition. We investigated whether differences in tracking of sung nursery rhymes during infancy relate to language development and autism symptoms in childhood. We assessed speech-brain coherence at either 10 or 14 months of age in a total of 22 infants with high likelihood of autism due to family history and 19 infants without family history of autism. We analyzed the relationship between speech-brain coherence in these infants and their vocabulary at 24 months as well as autism symptoms at 36 months. Our results showed significant speech-brain coherence in the 10- and 14-month-old infants. We found no evidence for a relationship between speech-brain coherence and later autism symptoms. Importantly, speech-brain coherence in the stressed syllable rate (1-3 Hz) predicted later vocabulary. Follow-up analyses showed evidence for a relationship between tracking and vocabulary only in 10-month-olds but not in 14-month-olds and indicated possible differences between the likelihood groups. Thus, early tracking of sung nursery rhymes is related to language development in childhood.
Collapse
Affiliation(s)
- Katharina H. Menn
- Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
- Department of Neuropsychology, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
- Research Group Language Cycles, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
- International Max Planck Research School on Neuroscience of Communication: Function, Structure, and Plasticity, Leipzig, Germany
| | - Emma K. Ward
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
| | - Ricarda Braukmann
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
| | - Carlijn van den Boomen
- Department of Experimental Psychology, Helmholtz Institute, Utrecht University, Utrecht, The Netherlands
| | - Jan Buitelaar
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
- Department of Cognitive Neuroscience, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Sabine Hunnius
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
| | - Tineke M. Snijders
- Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
- Cognitive Neuropsychology Department, Tilburg University
| |
Collapse
|
4
|
Abstract
To acquire language, infants must learn to segment words from running speech. A significant body of experimental research shows that infants use multiple cues to do so; however, little research has comprehensively examined the distribution of such cues in naturalistic speech. We conducted a comprehensive corpus analysis of German child-directed speech (CDS) using data from the Child Language Data Exchange System (CHILDES) database, investigating the availability of word stress, transitional probabilities (TPs), and lexical and sublexical frequencies as potential cues for word segmentation. Seven hours of data (~15,000 words) were coded, representing around an average day of speech to infants. The analysis revealed that for 97% of words, primary stress was carried by the initial syllable, implicating stress as a reliable cue to word onset in German CDS. Word identity was also marked by TPs between syllables, which were higher within than between words, and higher for backwards than forwards transitions. Words followed a Zipfian-like frequency distribution, and over two-thirds of words (78%) were monosyllabic. Of the 50 most frequent words, 82% were function words, which accounted for 47% of word tokens in the entire corpus. Finally, 15% of all utterances comprised single words. These results give rich novel insights into the availability of segmentation cues in German CDS, and support the possibility that infants draw on multiple converging cues to segment their input. The data, which we make openly available to the research community, will help guide future experimental investigations on this topic.
Collapse
Affiliation(s)
- Katja Stärk
- Katja Stärk, Language Development
Department, Max Planck Institute for Psycholinguistics, Wundtlaan 1, 6525 XD
Nijmegen, The Netherlands.
| | - Evan Kidd
- Language Development Department, Max Planck
Institute for Psycholinguistics, The Netherlands
- Research School of Psychology, The Australian
National University, Australia
- ARC Centre of Excellence for the Dynamics of
Language, Australia
| | - Rebecca L. A. Frost
- Language Development Department, Max Planck
Institute for Psycholinguistics, The Netherlands
| |
Collapse
|
5
|
Abstract
Voice modulatory cues such as variations in fundamental frequency, duration and pauses are key factors for structuring vocal signals in human speech and vocal communication in other tetrapods. Voice modulation physiology is highly similar in humans and other tetrapods due to shared ancestry and shared functional pressures for efficient communication. This has led to similarly structured vocalizations across humans and other tetrapods. Nonetheless, in their details, structural characteristics may vary across species and languages. Because data concerning voice modulation in non-human tetrapod vocal production and especially perception are relatively scarce compared to human vocal production and perception, this review focuses on voice modulatory cues used for speech segmentation across human languages, highlighting comparative data where available. Cues that are used similarly across many languages may help indicate which cues may result from physiological or basic cognitive constraints, and which cues may be employed more flexibly and are shaped by cultural evolution. This suggests promising candidates for future investigation of cues to structure in non-human tetrapod vocalizations. This article is part of the theme issue 'Voice modulation: from origin and mechanism to social impact (Part I)'.
Collapse
Affiliation(s)
- Theresa Matzinger
- Department of Behavioral and Cognitive Biology, University of Vienna, 1030 Vienna, Austria
- Department of English, University of Vienna, 1090 Vienna, Austria
| | - W. Tecumseh Fitch
- Department of Behavioral and Cognitive Biology, University of Vienna, 1030 Vienna, Austria
- Department of English, University of Vienna, 1090 Vienna, Austria
| |
Collapse
|
6
|
Gilbert AC, Lee JG, Coulter K, Wolpert MA, Kousaie S, Gracco VL, Klein D, Titone D, Phillips NA, Baum SR. Spoken Word Segmentation in First and Second Language: When ERP and Behavioral Measures Diverge. Front Psychol 2021; 12:705668. [PMID: 34603133 PMCID: PMC8485064 DOI: 10.3389/fpsyg.2021.705668] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Accepted: 08/18/2021] [Indexed: 11/24/2022] Open
Abstract
Previous studies of word segmentation in a second language have yielded equivocal results. This is not surprising given the differences in the bilingual experience and proficiency of the participants and the varied experimental designs that have been used. The present study tried to account for a number of relevant variables to determine if bilingual listeners are able to use native-like word segmentation strategies. Here, 61 French-English bilingual adults who varied in L1 (French or English) and language dominance took part in an audiovisual integration task while event-related brain potentials (ERPs) were recorded. Participants listened to sentences built around ambiguous syllable strings (which could be disambiguated based on different word segmentation patterns), during which an illustration was presented on screen. Participants were asked to determine if the illustration was related to the heard utterance or not. Each participant listened to both English and French utterances, providing segmentation patterns that included both their native language (used as reference) and their L2. Interestingly, different patterns of results were observed in the event-related potentials (online) and behavioral (offline) results, suggesting that L2 participants showed signs of being able to adapt their segmentation strategies to the specifics of the L2 (online ERP results), but that the extent of the adaptation varied as a function of listeners' language experience (offline behavioral results).
Collapse
Affiliation(s)
- Annie C Gilbert
- School of Communication Sciences and Disorders, McGill University, Montréal, QC, Canada.,Center for Research on Brain, Language and Music, Montréal, QC, Canada
| | - Jasmine G Lee
- Center for Research on Brain, Language and Music, Montréal, QC, Canada.,Integrated Program in Neuroscience, McGill University, Montréal, QC, Canada
| | - Kristina Coulter
- Center for Research on Brain, Language and Music, Montréal, QC, Canada.,Department of Psychology, Concordia University, Montréal, QC, Canada.,Center for Research in Human Development, Montréal, QC, Canada
| | - Max A Wolpert
- Center for Research on Brain, Language and Music, Montréal, QC, Canada.,Integrated Program in Neuroscience, McGill University, Montréal, QC, Canada
| | - Shanna Kousaie
- Center for Research on Brain, Language and Music, Montréal, QC, Canada.,Montreal Neurological Institute, McGill University, Montréal, QC, Canada.,School of Psychology, University of Ottawa, Ottawa, ON, Canada
| | - Vincent L Gracco
- School of Communication Sciences and Disorders, McGill University, Montréal, QC, Canada.,Haskins Laboratories, Yale University, New Haven, CT, United States
| | - Denise Klein
- Center for Research on Brain, Language and Music, Montréal, QC, Canada.,Montreal Neurological Institute, McGill University, Montréal, QC, Canada
| | - Debra Titone
- Center for Research on Brain, Language and Music, Montréal, QC, Canada.,Department of Psychology, McGill University, Montréal, QC, Canada
| | - Natalie A Phillips
- Center for Research on Brain, Language and Music, Montréal, QC, Canada.,Department of Psychology, Concordia University, Montréal, QC, Canada.,Center for Research in Human Development, Montréal, QC, Canada
| | - Shari R Baum
- School of Communication Sciences and Disorders, McGill University, Montréal, QC, Canada.,Center for Research on Brain, Language and Music, Montréal, QC, Canada
| |
Collapse
|
7
|
Abstract
A prerequisite for spoken language learning is segmenting continuous speech into words. Amongst many possible cues to identify word boundaries, listeners can use both transitional probabilities between syllables and various prosodic cues. However, the relative importance of these cues remains unclear, and previous experiments have not directly compared the effects of contrasting multiple prosodic cues. We used artificial language learning experiments, where native German speaking participants extracted meaningless trisyllabic “words” from a continuous speech stream, to evaluate these factors. We compared a baseline condition (statistical cues only) to five test conditions, in which word-final syllables were either (a) followed by a pause, (b) lengthened, (c) shortened, (d) changed to a lower pitch, or (e) changed to a higher pitch. To evaluate robustness and generality we used three tasks varying in difficulty. Overall, pauses and final lengthening were perceived as converging with the statistical cues and facilitated speech segmentation, with pauses helping most. Final-syllable shortening hindered baseline speech segmentation, indicating that when cues conflict, prosodic cues can override statistical cues. Surprisingly, pitch cues had little effect, suggesting that duration may be more relevant for speech segmentation than pitch in our study context. We discuss our findings with regard to the contribution to speech segmentation of language-universal boundary cues vs. language-specific stress patterns.
Collapse
Affiliation(s)
- Theresa Matzinger
- Department of English, University of Vienna, Vienna, Austria.,Department of Behavioral and Cognitive Biology, University of Vienna, Vienna, Austria
| | - Nikolaus Ritt
- Department of English, University of Vienna, Vienna, Austria
| | - W Tecumseh Fitch
- Department of Behavioral and Cognitive Biology, University of Vienna, Vienna, Austria.,Cognitive Science Hub, University of Vienna, Vienna, Austria
| |
Collapse
|
8
|
Hu G, Determan SC, Dong Y, Beeve AT, Collins JE, Gai Y. Spectral and Temporal Envelope Cues for Human and Automatic Speech Recognition in Noise. J Assoc Res Otolaryngol 2019; 21:73-87. [PMID: 31758279 DOI: 10.1007/s10162-019-00737-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Accepted: 09/16/2019] [Indexed: 11/30/2022] Open
Abstract
Acoustic features of speech include various spectral and temporal cues. It is known that temporal envelope plays a critical role for speech recognition by human listeners, while automated speech recognition (ASR) heavily relies on spectral analysis. This study compared sentence-recognition scores of humans and an ASR software, Dragon, when spectral and temporal-envelope cues were manipulated in background noise. Temporal fine structure of meaningful sentences was reduced by noise or tone vocoders. Three types of background noise were introduced: a white noise, a time-reversed multi-talker noise, and a fake-formant noise. Spectral information was manipulated by changing the number of frequency channels. With a 20-dB signal-to-noise ratio (SNR) and four vocoding channels, white noise had a stronger disruptive effect than the fake-formant noise. The same observation with 22 channels was made when SNR was lowered to 0 dB. In contrast, ASR was unable to function with four vocoding channels even with a 20-dB SNR. Its performance was least affected by white noise and most affected by the fake-formant noise. Increasing the number of channels, which improved the spectral resolution, generated non-monotonic behaviors for the ASR with white noise but not with colored noise. The ASR also showed highly improved performance with tone vocoders. It is possible that fake-formant noise affected the software's performance by disrupting spectral cues, whereas white noise affected performance by compromising speech segmentation. Overall, these results suggest that human listeners and ASR utilize different listening strategies in noise.
Collapse
Affiliation(s)
- Guangxin Hu
- Biomedical Engineering Department, Saint Louis University, 3007 Lindell Blvd Suite 2007, St Louis, MO, 63103, USA
| | - Sarah C Determan
- Biomedical Engineering Department, Saint Louis University, 3007 Lindell Blvd Suite 2007, St Louis, MO, 63103, USA
| | - Yue Dong
- Biomedical Engineering Department, Saint Louis University, 3007 Lindell Blvd Suite 2007, St Louis, MO, 63103, USA
| | - Alec T Beeve
- Biomedical Engineering Department, Saint Louis University, 3007 Lindell Blvd Suite 2007, St Louis, MO, 63103, USA
| | - Joshua E Collins
- Biomedical Engineering Department, Saint Louis University, 3007 Lindell Blvd Suite 2007, St Louis, MO, 63103, USA
| | - Yan Gai
- Biomedical Engineering Department, Saint Louis University, 3007 Lindell Blvd Suite 2007, St Louis, MO, 63103, USA.
| |
Collapse
|
9
|
Frost RLA, Monaghan P, Christiansen MH. Mark my words: High frequency marker words impact early stages of language learning. J Exp Psychol Learn Mem Cogn 2019; 45:1883-1898. [PMID: 30652894 PMCID: PMC6746567 DOI: 10.1037/xlm0000683] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2018] [Revised: 11/08/2018] [Accepted: 11/08/2018] [Indexed: 11/17/2022]
Abstract
High frequency words have been suggested to benefit both speech segmentation and grammatical categorization of the words around them. Despite utilizing similar information, these tasks are usually investigated separately in studies examining learning. We determined whether including high frequency words in continuous speech could support categorization when words are being segmented for the first time. We familiarized learners with continuous artificial speech comprising repetitions of target words, which were preceded by high-frequency marker words. Crucially, marker words distinguished targets into 2 distributionally defined categories. We measured learning with segmentation and categorization tests and compared performance against a control group that heard the artificial speech without these marker words (i.e., just the targets, with no cues for categorization). Participants segmented the target words from speech in both conditions, but critically when the marker words were present, they influenced acquisition of word-referent mappings in a subsequent transfer task, with participants demonstrating better early learning for mappings that were consistent (rather than inconsistent) with the distributional categories. We propose that high-frequency words may assist early grammatical categorization, while speech segmentation is still being learned. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Collapse
|
10
|
Abstract
Speech and action sequences are continuous streams of information that can be segmented into sub-units. In both domains, this segmentation can be facilitated by perceptual cues contained within the information stream. In speech, prosodic cues (e.g., a pause, pre-boundary lengthening, and pitch rise) mark boundaries between words and phrases, while boundaries between actions of an action sequence can be marked by kinematic cues (e.g., a pause, pre-boundary deceleration). The processing of prosodic boundary cues evokes an Event-related Potentials (ERP) component known as the Closure Positive Shift (CPS), and it is possible that the CPS reflects domain-general cognitive processes involved in segmentation, given that the CPS is also evoked by boundaries between subunits of non-speech auditory stimuli. This study further probed the domain-generality of the CPS and its underlying processes by investigating electrophysiological correlates of the processing of boundary cues in sequences of spoken verbs (auditory stimuli; Experiment 1; N = 23 adults) and actions (visual stimuli; Experiment 2; N = 23 adults). The EEG data from both experiments revealed a CPS-like broadly distributed positivity during the 250 ms prior to the onset of the post-boundary word or action, indicating similar electrophysiological correlates of boundary processing across domains, suggesting that the cognitive processes underlying speech and action segmentation might also be shared.
Collapse
Affiliation(s)
- Matt Hilton
- Department of Psychology, Cognitive Sciences, University of Potsdam, Potsdam, Germany
| | - Romy Räling
- Department of Linguistics, Cognitive Sciences, University of Potsdam, Potsdam, Germany
| | - Isabell Wartenburger
- Department of Linguistics, Cognitive Sciences, University of Potsdam, Potsdam, Germany
| | - Birgit Elsner
- Department of Psychology, Cognitive Sciences, University of Potsdam, Potsdam, Germany
| |
Collapse
|
11
|
Fló A, Brusini P, Macagno F, Nespor M, Mehler J, Ferry AL. Newborns are sensitive to multiple cues for word segmentation in continuous speech. Dev Sci 2019; 22:e12802. [PMID: 30681763 DOI: 10.1111/desc.12802] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Revised: 01/19/2019] [Accepted: 01/21/2019] [Indexed: 11/30/2022]
Abstract
Before infants can learn words, they must identify those words in continuous speech. Yet, the speech signal lacks obvious boundary markers, which poses a potential problem for language acquisition (Swingley, Philos Trans R Soc Lond. Series B, Biol Sci 364(1536), 3617-3632, 2009). By the middle of the first year, infants seem to have solved this problem (Bergelson & Swingley, Proc Natl Acad Sci 109(9), 3253-3258, 2012; Jusczyk & Aslin, Cogn Psychol 29, 1-23, 1995), but it is unknown if segmentation abilities are present from birth, or if they only emerge after sufficient language exposure and/or brain maturation. Here, in two independent experiments, we looked at two cues known to be crucial for the segmentation of human speech: the computation of statistical co-occurrences between syllables and the use of the language's prosody. After a brief familiarization of about 3 min with continuous speech, using functional near-infrared spectroscopy, neonates showed differential brain responses on a recognition test to words that violated either the statistical (Experiment 1) or prosodic (Experiment 2) boundaries of the familiarization, compared to words that conformed to those boundaries. Importantly, word recognition in Experiment 2 occurred even in the absence of prosodic information at test, meaning that newborns encoded the phonological content independently of its prosody. These data indicate that humans are born with operational language processing and memory capacities and can use at least two types of cues to segment otherwise continuous speech, a key first step in language acquisition.
Collapse
Affiliation(s)
- Ana Fló
- Language, Cognition, and Development Laboratory, Scuola Internazionale di Studi Avanzati, Trieste, Italy.,Cognitive Neuroimaging Unit, Commissariat à l'Energie Atomique (CEA), Institut National de la Santé et de la Recherche Médicale (INSERM) U992, NeuroSpin Center, Gif-sur-Yvette, France
| | - Perrine Brusini
- Language, Cognition, and Development Laboratory, Scuola Internazionale di Studi Avanzati, Trieste, Italy.,Institute of Psychology Health and Society, University of Liverpool, Liverpool, UK
| | - Francesco Macagno
- Neonatology Unit, Azienda Ospedaliera Santa Maria della Misericordia, Udine, Italy
| | - Marina Nespor
- Language, Cognition, and Development Laboratory, Scuola Internazionale di Studi Avanzati, Trieste, Italy
| | - Jacques Mehler
- Language, Cognition, and Development Laboratory, Scuola Internazionale di Studi Avanzati, Trieste, Italy
| | - Alissa L Ferry
- Language, Cognition, and Development Laboratory, Scuola Internazionale di Studi Avanzati, Trieste, Italy.,Division of Human Communication, Hearing, and Development, University of Manchester, Manchester, UK
| |
Collapse
|
12
|
Abstract
Research has demonstrated distinct roles for consonants and vowels in speech processing. For example, consonants have been shown to support lexical processes, such as the segmentation of speech based on transitional probabilities (TPs), more effectively than vowels. Theory and data so far, however, have considered only non-tone languages, that is to say, languages that lack contrastive lexical tones. In the present work, we provide a first investigation of the role of consonants and vowels in statistical speech segmentation by native speakers of Cantonese, as well as assessing how tones modulate the processing of vowels. Results show that Cantonese speakers are unable to use statistical cues carried by consonants for segmentation, but they can use cues carried by vowels. This difference becomes more evident when considering tone-bearing vowels. Additional data from speakers of Russian and Mandarin suggest that the ability of Cantonese speakers to segment streams with statistical cues carried by tone-bearing vowels extends to other tone languages, but is much reduced in speakers of non-tone languages.
Collapse
Affiliation(s)
- David M Gómez
- Institute for Educational Sciences, Universidad de O'Higgins, Chile; Center for Advanced Research in Education (CIAE), Universidad de Chile, Chile
| | - Peggy Mok
- Department of Linguistics and Modern Languages, The Chinese University of Hong Kong, Hong Kong
| | - Mikhail Ordin
- Basque Centre on Cognition, Brain, and Language (BCBL), Spain; Basque Foundation for Science (IKERBASQUE), Spain
| | - Jacques Mehler
- Language, Cognition, and Development Lab, International School for Advanced Studies (SISSA), Italy
| | - Marina Nespor
- Language, Cognition, and Development Lab, International School for Advanced Studies (SISSA), Italy
| |
Collapse
|
13
|
Sidiras C, Iliadou V, Nimatoudis I, Reichenbach T, Bamiou DE. Spoken Word Recognition Enhancement Due to Preceding Synchronized Beats Compared to Unsynchronized or Unrhythmic Beats. Front Neurosci 2017; 11:415. [PMID: 28769752 PMCID: PMC5513984 DOI: 10.3389/fnins.2017.00415] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2017] [Accepted: 07/04/2017] [Indexed: 11/16/2022] Open
Abstract
The relation between rhythm and language has been investigated over the last decades, with evidence that these share overlapping perceptual mechanisms emerging from several different strands of research. The dynamic Attention Theory posits that neural entrainment to musical rhythm results in synchronized oscillations in attention, enhancing perception of other events occurring at the same rate. In this study, this prediction was tested in 10 year-old children by means of a psychoacoustic speech recognition in babble paradigm. It was hypothesized that rhythm effects evoked via a short isochronous sequence of beats would provide optimal word recognition in babble when beats and word are in sync. We compared speech recognition in babble performance in the presence of isochronous and in sync vs. non-isochronous or out of sync sequence of beats. Results showed that (a) word recognition was the best when rhythm and word were in sync, and (b) the effect was not uniform across syllables and gender of subjects. Our results suggest that pure tone beats affect speech recognition at early levels of sensory or phonemic processing.
Collapse
Affiliation(s)
- Christos Sidiras
- Clinical Psychoacoustics Laboratory, Neuroscience Division, 3rd Psychiatric Department, Aristotle University of ThessalonikiThessaloniki, Greece
| | - Vasiliki Iliadou
- Clinical Psychoacoustics Laboratory, Neuroscience Division, 3rd Psychiatric Department, Aristotle University of ThessalonikiThessaloniki, Greece
| | - Ioannis Nimatoudis
- Clinical Psychoacoustics Laboratory, Neuroscience Division, 3rd Psychiatric Department, Aristotle University of ThessalonikiThessaloniki, Greece
| | - Tobias Reichenbach
- Department of Bioengineering, Imperial College LondonLondon, United Kingdom
| | - Doris-Eva Bamiou
- Faculty of Brain Sciences, UCL Ear Institute, University College LondonLondon, United Kingdom
| |
Collapse
|
14
|
Abstract
Anecdotal evidence suggests that unfamiliar languages sound faster than one’s native language. Empirical evidence for this impression has, so far, come from explicit rate judgments. The aim of the present study was to test whether such perceived rate differences between native and foreign languages (FLs) have effects on implicit speech processing. Our measure of implicit rate perception was “normalization for speech rate”: an ambiguous vowel between short /a/ and long /a:/ is interpreted as /a:/ following a fast but as /a/ following a slow carrier sentence. That is, listeners did not judge speech rate itself; instead, they categorized ambiguous vowels whose perception was implicitly affected by the rate of the context. We asked whether a bias towards long /a:/ might be observed when the context is not actually faster but simply spoken in a FL. A fully symmetrical experimental design was used: Dutch and German participants listened to rate matched (fast and slow) sentences in both languages spoken by the same bilingual speaker. Sentences were followed by non-words that contained vowels from an /a-a:/ duration continuum. Results from Experiments 1 and 2 showed a consistent effect of rate normalization for both listener groups. Moreover, for German listeners, across the two experiments, foreign sentences triggered more /a:/ responses than (rate matched) native sentences, suggesting that foreign sentences were indeed perceived as faster. Moreover, this FL effect was modulated by participants’ ability to understand the FL: those participants that scored higher on a FL translation task showed less of a FL effect. However, opposite effects were found for the Dutch listeners. For them, their native rather than the FL induced more /a:/ responses. Nevertheless, this reversed effect could be reduced when additional spectral properties of the context were controlled for. Experiment 3, using explicit rate judgments, replicated the effect for German but not Dutch listeners. We therefore conclude that the subjective impression that FLs sound fast may have an effect on implicit speech processing, with implications for how language learners perceive spoken segments in a FL.
Collapse
Affiliation(s)
- Hans Rutger Bosker
- Max Planck Institute for PsycholinguisticsNijmegen, Netherlands.,Donders Institute for Brain, Cognition and Behaviour, Radboud UniversityNijmegen, Netherlands
| | - Eva Reinisch
- Institute of Phonetics and Speech Processing, Ludwig Maximilian University of MunichMunich, Germany
| |
Collapse
|
15
|
Abstract
The identification of words in continuous speech, known as speech segmentation, is a critical early step in language acquisition. This process is partially supported by statistical learning, the ability to extract patterns from the environment. Given that speech segmentation represents a potential bottleneck for language acquisition, patterns in speech may be extracted very rapidly, without extensive exposure. This hypothesis was examined by exposing participants to continuous speech streams composed of novel repeating nonsense words. Learning was measured on-line using a reaction time task. After merely one exposure to an embedded novel word, learners demonstrated significant learning effects, as revealed by faster responses to predictable than to unpredictable syllables. These results demonstrate that learners gained sensitivity to the statistical structure of unfamiliar speech on a very rapid timescale. This ability may play an essential role in early stages of language acquisition, allowing learners to rapidly identify word candidates and "break in" to an unfamiliar language.
Collapse
|
16
|
Kösem A, Basirat A, Azizi L, van Wassenhove V. High-frequency neural activity predicts word parsing in ambiguous speech streams. J Neurophysiol 2016; 116:2497-2512. [PMID: 27605528 DOI: 10.1152/jn.00074.2016] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2016] [Accepted: 09/03/2016] [Indexed: 11/22/2022] Open
Abstract
During speech listening, the brain parses a continuous acoustic stream of information into computational units (e.g., syllables or words) necessary for speech comprehension. Recent neuroscientific hypotheses have proposed that neural oscillations contribute to speech parsing, but whether they do so on the basis of acoustic cues (bottom-up acoustic parsing) or as a function of available linguistic representations (top-down linguistic parsing) is unknown. In this magnetoencephalography study, we contrasted acoustic and linguistic parsing using bistable speech sequences. While listening to the speech sequences, participants were asked to maintain one of the two possible speech percepts through volitional control. We predicted that the tracking of speech dynamics by neural oscillations would not only follow the acoustic properties but also shift in time according to the participant's conscious speech percept. Our results show that the latency of high-frequency activity (specifically, beta and gamma bands) varied as a function of the perceptual report. In contrast, the phase of low-frequency oscillations was not strongly affected by top-down control. Whereas changes in low-frequency neural oscillations were compatible with the encoding of prelexical segmentation cues, high-frequency activity specifically informed on an individual's conscious speech percept.
Collapse
Affiliation(s)
- Anne Kösem
- Cognitive Neuroimaging Unit, CEA DRF/I2BM, Institut National de la Santé et de la Recherche Médicale, Université Paris-Sud, Université Paris-Saclay, Gif/Yvette, France; .,Radboud University, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands.,Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands; and
| | - Anahita Basirat
- Cognitive Neuroimaging Unit, CEA DRF/I2BM, Institut National de la Santé et de la Recherche Médicale, Université Paris-Sud, Université Paris-Saclay, Gif/Yvette, France.,SCALab, Centre National de la Recherche Scientifique UMR 9193, Université Lille, Lille, France
| | - Leila Azizi
- Cognitive Neuroimaging Unit, CEA DRF/I2BM, Institut National de la Santé et de la Recherche Médicale, Université Paris-Sud, Université Paris-Saclay, Gif/Yvette, France
| | - Virginie van Wassenhove
- Cognitive Neuroimaging Unit, CEA DRF/I2BM, Institut National de la Santé et de la Recherche Médicale, Université Paris-Sud, Université Paris-Saclay, Gif/Yvette, France
| |
Collapse
|
17
|
Tremblay A, Broersma M, Coughlin CE, Choi J. Effects of the Native Language on the Learning of Fundamental Frequency in Second-Language Speech Segmentation. Front Psychol 2016; 7:985. [PMID: 27445943 PMCID: PMC4925665 DOI: 10.3389/fpsyg.2016.00985] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Accepted: 06/14/2016] [Indexed: 11/13/2022] Open
Abstract
This study investigates whether the learning of prosodic cues to word boundaries in speech segmentation is more difficult if the native and second/foreign languages (L1 and L2) have similar (though non-identical) prosodies than if they have markedly different prosodies (Prosodic-Learning Interference Hypothesis). It does so by comparing French, Korean, and English listeners' use of fundamental-frequency (F0) rise as a cue to word-final boundaries in French. F0 rise signals phrase-final boundaries in French and Korean but word-initial boundaries in English. Korean-speaking and English-speaking L2 learners of French, who were matched in their French proficiency and French experience, and native French listeners completed a visual-world eye-tracking experiment in which they recognized words whose final boundary was or was not cued by an increase in F0. The results showed that Korean listeners had greater difficulty using F0 rise as a cue to word-final boundaries in French than French and English listeners. This suggests that L1-L2 prosodic similarity can make the learning of an L2 segmentation cue difficult, in line with the proposed Prosodic-Learning Interference Hypothesis. We consider mechanisms that may underlie this difficulty and discuss the implications of our findings for understanding listeners' phonological encoding of L2 words.
Collapse
|
18
|
Abstract
Speech is inextricably multisensory: both auditory and visual components provide critical information for all aspects of speech processing, including speech segmentation, the visual components of which have been the target of a growing number of studies. In particular, a recent study (Mitchel and Weiss, 2014) established that adults can utilize facial cues (i.e., visual prosody) to identify word boundaries in fluent speech. The current study expanded upon these results, using an eye tracker to identify highly attended facial features of the audiovisual display used in Mitchel and Weiss (2014). Subjects spent the most time watching the eyes and mouth. A significant trend in gaze durations was found with the longest gaze duration on the mouth, followed by the eyes and then the nose. In addition, eye-gaze patterns changed across familiarization as subjects learned the word boundaries, showing decreased attention to the mouth in later blocks while attention on other facial features remained consistent. These findings highlight the importance of the visual component of speech processing and suggest that the mouth may play a critical role in visual speech segmentation.
Collapse
Affiliation(s)
- Laina G Lusk
- Neuroscience Program, Bucknell University Lewisburg, PA, USA
| | - Aaron D Mitchel
- Neuroscience Program, Bucknell UniversityLewisburg, PA, USA; Department of Psychology, Bucknell UniversityLewisburg, PA, USA
| |
Collapse
|
19
|
Chait M, Greenberg S, Arai T, Simon JZ, Poeppel D. Multi-time resolution analysis of speech: evidence from psychophysics. Front Neurosci 2015; 9:214. [PMID: 26136650 PMCID: PMC4468943 DOI: 10.3389/fnins.2015.00214] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Accepted: 05/28/2015] [Indexed: 11/13/2022] Open
Abstract
How speech signals are analyzed and represented remains a foundational challenge both for cognitive science and neuroscience. A growing body of research, employing various behavioral and neurobiological experimental techniques, now points to the perceptual relevance of both phoneme-sized (10-40 Hz modulation frequency) and syllable-sized (2-10 Hz modulation frequency) units in speech processing. However, it is not clear how information associated with such different time scales interacts in a manner relevant for speech perception. We report behavioral experiments on speech intelligibility employing a stimulus that allows us to investigate how distinct temporal modulations in speech are treated separately and whether they are combined. We created sentences in which the slow (~4 Hz; Slow) and rapid (~33 Hz; Shigh) modulations-corresponding to ~250 and ~30 ms, the average duration of syllables and certain phonetic properties, respectively-were selectively extracted. Although Slow and Shigh have low intelligibility when presented separately, dichotic presentation of Shigh with Slow results in supra-additive performance, suggesting a synergistic relationship between low- and high-modulation frequencies. A second experiment desynchronized presentation of the Slow and Shigh signals. Desynchronizing signals relative to one another had no impact on intelligibility when delays were less than ~45 ms. Longer delays resulted in a steep intelligibility decline, providing further evidence of integration or binding of information within restricted temporal windows. Our data suggest that human speech perception uses multi-time resolution processing. Signals are concurrently analyzed on at least two separate time scales, the intermediate representations of these analyses are integrated, and the resulting bound percept has significant consequences for speech intelligibility-a view compatible with recent insights from neuroscience implicating multi-timescale auditory processing.
Collapse
Affiliation(s)
- Maria Chait
- Neuroscience and Cognitive Science Program, University of Maryland College Park, MD, USA ; Department of Linguistics, University of Maryland College Park, MD, USA
| | | | - Takayuki Arai
- Department of Information and Communication Sciences, Sophia University Tokyo, Japan
| | - Jonathan Z Simon
- Neuroscience and Cognitive Science Program, University of Maryland College Park, MD, USA ; Department of Biology, University of Maryland College Park, MD, USA ; Department of Electrical and Computer Engineering, University of Maryland College Park, MD, USA ; Institute for Systems Research, University of Maryland College Park, MD, USA
| | - David Poeppel
- Neuroscience and Cognitive Science Program, University of Maryland College Park, MD, USA ; Department of Linguistics, University of Maryland College Park, MD, USA ; Department of Psychology, New York University New York, NY, USA ; Department of Neuroscience, Max-Planck-Institute Frankfurt, Germany
| |
Collapse
|
20
|
Peñaloza C, Benetello A, Tuomiranta L, Heikius IM, Järvinen S, Majos MC, Cardona P, Juncadella M, Laine M, Martin N, Rodríguez-Fornells A. Speech segmentation in aphasia. Aphasiology 2014; 29:724-743. [PMID: 28824218 PMCID: PMC5560767 DOI: 10.1080/02687038.2014.982500] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
BACKGROUND Speech segmentation is one of the initial and mandatory phases of language learning. Although some people with aphasia have shown a preserved ability to learn novel words, their speech segmentation abilities have not been explored. AIMS We examined the ability of individuals with chronic aphasia to segment words from running speech via statistical learning. We also explored the relationships between speech segmentation and aphasia severity, and short-term memory capacity. We further examined the role of lesion location in speech segmentation and short-term memory performance. METHODS & PROCEDURES The experimental task was first validated with a group of young adults (n = 120). Participants with chronic aphasia (n = 14) were exposed to an artificial language and were evaluated in their ability to segment words using a speech segmentation test. Their performance was contrasted against chance level and compared to that of a group of elderly matched controls (n = 14) using group and case-by-case analyses. OUTCOMES & RESULTS As a group, participants with aphasia were significantly above chance level in their ability to segment words from the novel language and did not significantly differ from the group of elderly controls. Speech segmentation ability in the aphasic participants was not associated with aphasia severity although it significantly correlated with word pointing span, a measure of verbal short-term memory. Case-by-case analyses identified four individuals with aphasia who performed above chance level on the speech segmentation task, all with predominantly posterior lesions and mild fluent aphasia. Their short-term memory capacity was also better preserved than in the rest of the group. CONCLUSIONS Our findings indicate that speech segmentation via statistical learning can remain functional in people with chronic aphasia and suggest that this initial language learning mechanism is associated with the functionality of the verbal short-term memory system and the integrity of the left inferior frontal region.
Collapse
Affiliation(s)
- Claudia Peñaloza
- Cognition and Brain Plasticity Group, Bellvitge Biomedical Research Institute – IDIBELL, Barcelona, Spain
| | - Annalisa Benetello
- Department of Communication Sciences and Disorders, Eleanor M. Saffran Center for Cognitive Neuroscience, Temple University, Philadelphia, PA, USA
- Department of Psychology, University of Milano-Bicocca, Milan, Italy
| | - Leena Tuomiranta
- Department of Psychology and Logopedics, Abo Akademi University, Turku, Finland
| | - Ida-Maria Heikius
- Department of Psychology and Logopedics, Abo Akademi University, Turku, Finland
| | - Sonja Järvinen
- Department of Psychology and Logopedics, Abo Akademi University, Turku, Finland
| | - Maria Carmen Majos
- Hospital Universitari de Bellvitge (HUB), Rehabilitation Section, Campus Bellvitge, University of Barcelona, Barcelona, Spain
| | - Pedro Cardona
- Hospital Universitari de Bellvitge (HUB), Neurology Section, Campus Bellvitge, University of Barcelona, Barcelona, Spain
| | - Montserrat Juncadella
- Hospital Universitari de Bellvitge (HUB), Neurology Section, Campus Bellvitge, University of Barcelona, Barcelona, Spain
| | - Matti Laine
- Department of Psychology and Logopedics, Abo Akademi University, Turku, Finland
| | - Nadine Martin
- Department of Communication Sciences and Disorders, Eleanor M. Saffran Center for Cognitive Neuroscience, Temple University, Philadelphia, PA, USA
| | - Antoni Rodríguez-Fornells
- Cognition and Brain Plasticity Group, Bellvitge Biomedical Research Institute – IDIBELL, Barcelona, Spain
- Department of Basic Psychology, Campus Bellvitge, University of Barcelona, Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats, ICREA, Barcelona, Spain
| |
Collapse
|
21
|
Abstract
Recent behavioral and electrophysiological evidence has highlighted the long-term importance for language skills of an early ability to recognize words in continuous speech. We here present further tests of this long-term link in the form of follow-up studies conducted with two (separate) groups of infants who had earlier participated in speech segmentation tasks. Each study extends prior follow-up tests: Study 1 by using a novel follow-up measure that taps into online processing, Study 2 by assessing language performance relationships over a longer time span than previously tested. Results of Study 1 show that brain correlates of speech segmentation ability at 10 months are positively related to 16-month-olds' target fixations in a looking-while-listening task. Results of Study 2 show that infant speech segmentation ability no longer directly predicts language profiles at the age of five. However, a meta-analysis across our results and those of similar studies (Study 3) reveals that age at follow-up does not moderate effect size. Together, the results suggest that infants' ability to recognize words in speech certainly benefits early vocabulary development; further observed relationships of later language skills to early word recognition may be consequent upon this vocabulary size effect.
Collapse
Affiliation(s)
- Caroline Junge
- Utrecht University, Heidelberglaan 1, 3584 CS Utrecht, The Netherlands.
| | - Anne Cutler
- MARCS Institute, University of Western Sydney, Locked Bag 1797, Penrith, NSW 2751, Australia.
| |
Collapse
|
22
|
Breen M, Dilley LC, McAuley JD, Sanders LD. Auditory evoked potentials reveal early perceptual effects of distal prosody on speech segmentation. Lang Cogn Neurosci 2014; 29:1132-1146. [PMID: 29911124 PMCID: PMC5998818 DOI: 10.1080/23273798.2014.894642] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Prosodic context several syllables prior (i.e., distal) to an ambiguous word boundary influences speech segmentation. To assess whether distal prosody influences early perceptual processing or later lexical competition, EEG was recorded while subjects listened to eight-syllable sequences with ambiguous word boundaries for the last four syllables (e.g., tie murder bee vs. timer derby). Pitch and duration of the first 5 syllables were manipulated to induce sequence segmentation with either a monosyllabic or disyllabic final word. Behavioral results confirmed a successful manipulation. Moreover, penultimate syllables (e.g., der) elicited a larger anterior positivity 200-500 ms after onset for prosodic contexts predicted to induce word-initial perception of these syllables. Final syllables (e.g. bee) elicited a similar anterior positivity in the context predicted to induce word-initial perception of these syllables. Additionally, these final syllables elicited a larger positive-to-negative deflection (P1-N1) 60-120 ms after onset, and a larger N400. The finding that prosodic characteristics of speech several syllables prior to ambiguous word boundaries modulate both early and late ERPs elicited by subsequent syllable onsets provides evidence that distal prosody influences early perceptual processing, and later lexical competition.
Collapse
Affiliation(s)
- Mara Breen
- Mount Holyoke College, Department of Psychology and Education
- University of Massachusetts, Department of Psychology
| | - Laura C. Dilley
- Michigan State University, Department of Communicative Sciences and Disorders
- Michigan State University, Department of Psychology
- Michigan State University, Department of Linguistics and Germanic, Slavic, Asian, and African Languages
| | | | | |
Collapse
|
23
|
Abstract
Speech is typically a multimodal phenomenon, yet few studies have focused on the exclusive contributions of visual cues to language acquisition. To address this gap, we investigated whether visual prosodic information can facilitate speech segmentation. Previous research has demonstrated that language learners can use lexical stress and pitch cues to segment speech and that learners can extract this information from talking faces. Thus, we created an artificial speech stream that contained minimal segmentation cues and paired it with two synchronous facial displays in which visual prosody was either informative or uninformative for identifying word boundaries. Across three familiarisation conditions (audio stream alone, facial streams alone, and paired audiovisual), learning occurred only when the facial displays were informative to word boundaries, suggesting that facial cues can help learners solve the early challenges of language acquisition.
Collapse
Affiliation(s)
- Aaron D. Mitchel
- Department of Psychology, Bucknell University, Lewisburg, PA 17837, USA
| | - Daniel J. Weiss
- Department of Psychology and Program in Linguistics, The Pennsylvania State University, 643 Moore Building, University Park, PA 16802, USA
| |
Collapse
|
24
|
Abstract
The ability to extract word forms from continuous speech is a prerequisite for constructing a vocabulary and emerges in the first year of life. Electrophysiological (ERP) studies of speech segmentation by 9- to 12-month-old listeners in several languages have found a left-localized negativity linked to word onset as a marker of word detection. We report an ERP study showing significant evidence of speech segmentation in Dutch-learning 7-month-olds. In contrast to the left-localized negative effect reported with older infants, the observed overall mean effect had a positive polarity. Inspection of individual results revealed two participant sub-groups: a majority showing a positive-going response, and a minority showing the left negativity observed in older age groups. We retested participants at age three, on vocabulary comprehension and word and sentence production. On every test, children who at 7 months had shown the negativity associated with segmentation of words from speech outperformed those who had produced positive-going brain responses to the same input. The earlier that infants show the left-localized brain responses typically indicating detection of words in speech, the better their early childhood language skills.
Collapse
Affiliation(s)
- Valesca Kooijman
- Food and Biobased Research, Wageningen University and Research Centre Wageningen, Netherlands
| | | | | | | | | |
Collapse
|
25
|
White L, Mattys SL, Wiget L. Segmentation cues in conversational speech: robust semantics and fragile phonotactics. Front Psychol 2012; 3:375. [PMID: 23060839 PMCID: PMC3464055 DOI: 10.3389/fpsyg.2012.00375] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2012] [Accepted: 09/12/2012] [Indexed: 11/13/2022] Open
Abstract
Multiple cues influence listeners' segmentation of connected speech into words, but most previous studies have used stimuli elicited in careful readings rather than natural conversation. Discerning word boundaries in conversational speech may differ from the laboratory setting. In particular, a speaker's articulatory effort - hyperarticulation vs. hypoarticulation (H&H) - may vary according to communicative demands, suggesting a compensatory relationship whereby acoustic-phonetic cues are attenuated when other information sources strongly guide segmentation. We examined how listeners' interpretation of segmentation cues is affected by speech style (spontaneous conversation vs. read), using cross-modal identity priming. To elicit spontaneous stimuli, we used a map task in which speakers discussed routes around stylized landmarks. These landmarks were two-word phrases in which the strength of potential segmentation cues - semantic likelihood and cross-boundary diphone phonotactics - was systematically varied. Landmark-carrying utterances were transcribed and later re-recorded as read speech. Independent of speech style, we found an interaction between cue valence (favorable/unfavorable) and cue type (phonotactics/semantics). Thus, there was an effect of semantic plausibility, but no effect of cross-boundary phonotactics, indicating that the importance of phonotactic segmentation may have been overstated in studies where lexical information was artificially suppressed. These patterns were unaffected by whether the stimuli were elicited in a spontaneous or read context, even though the difference in speech styles was evident in a main effect. Durational analyses suggested speaker-driven cue trade-offs congruent with an H&H account, but these modulations did not impact on listener behavior. We conclude that previous research exploiting read speech is reliable in indicating the primacy of lexically based cues in the segmentation of natural conversational speech.
Collapse
|
26
|
Yurovsky D, Yu C, Smith LB. Statistical speech segmentation and word learning in parallel: scaffolding from child-directed speech. Front Psychol 2012; 3:374. [PMID: 23162487 PMCID: PMC3498894 DOI: 10.3389/fpsyg.2012.00374] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2012] [Accepted: 09/11/2012] [Indexed: 11/29/2022] Open
Abstract
In order to acquire their native languages, children must learn richly structured systems with regularities at multiple levels. While structure at different levels could be learned serially, e.g., speech segmentation coming before word-object mapping, redundancies across levels make parallel learning more efficient. For instance, a series of syllables is likely to be a word not only because of high transitional probabilities, but also because of a consistently co-occurring object. But additional statistics require additional processing, and thus might not be useful to cognitively constrained learners. We show that the structure of child-directed speech makes simultaneous speech segmentation and word learning tractable for human learners. First, a corpus of child-directed speech was recorded from parents and children engaged in a naturalistic free-play task. Analyses revealed two consistent regularities in the sentence structure of naming events. These regularities were subsequently encoded in an artificial language to which adult participants were exposed in the context of simultaneous statistical speech segmentation and word learning. Either regularity was independently sufficient to support successful learning, but no learning occurred in the absence of both regularities. Thus, the structure of child-directed speech plays an important role in scaffolding speech segmentation and word learning in parallel.
Collapse
Affiliation(s)
- Daniel Yurovsky
- Department of Psychology, Stanford UniversityStanford, CA, USA
| | - Chen Yu
- Department of Psychological and Brain Sciences and Program in Cognitive Science, Indiana UniversityBloomington, IN, USA
| | - Linda B. Smith
- Department of Psychological and Brain Sciences and Program in Cognitive Science, Indiana UniversityBloomington, IN, USA
| |
Collapse
|
27
|
Abstract
Linguistic stress and sequential statistical cues to word boundaries interact during speech segmentation in infancy. However, little is known about how the different acoustic components of stress constrain statistical learning. The current studies were designed to investigate whether intensity and duration each function independently as cues to initial prominence (trochaic-based hypothesis) or whether, as predicted by the Iambic-Trochaic Law (ITL), intensity and duration have characteristic and separable effects on rhythmic grouping (ITL-based hypothesis) in a statistical learning task. Infants were familiarized with an artificial language (Experiments 1 & 3) or a tone stream (Experiment 2) in which there was an alternation in either intensity or duration. In addition to potential acoustic cues, the familiarization sequences also contained statistical cues to word boundaries. In speech (Experiment 1) and non-speech (Experiment 2) conditions, 9-month-old infants demonstrated discrimination patterns consistent with an ITL-based hypothesis: intensity signaled initial prominence and duration signaled final prominence. The results of Experiment 3, in which 6.5-month-old infants were familiarized with the speech streams from Experiment 1, suggest that there is a developmental change in infants' willingness to treat increased duration as a cue to word offsets in fluent speech. Infants' perceptual systems interact with linguistic experience to constrain how infants learn from their auditory environment.
Collapse
Affiliation(s)
- Jessica F Hay
- University of Tennessee - Knoxville; Department of Psychology
| | | |
Collapse
|
28
|
Abstract
It is currently unknown whether statistical learning is supported by modality-general or modality-specific mechanisms. One issue within this debate concerns the independence of learning in one modality from learning in other modalities. In the present study, the authors examined the extent to which statistical learning across modalities is independent by simultaneously presenting learners with auditory and visual streams. After establishing baseline rates of learning for each stream independently, they systematically varied the amount of audiovisual correspondence across 3 experiments. They found that learners were able to segment both streams successfully only when the boundaries of the audio and visual triplets were in alignment. This pattern of results suggests that learners are able to extract multiple statistical regularities across modalities provided that there is some degree of cross-modal coherence. They discuss the implications of their results in light of recent claims that multisensory statistical learning is guided by modality-independent mechanisms.
Collapse
Affiliation(s)
- Aaron D Mitchel
- Department of Psychology and Program in Linguistics, Pennsylvania State University, 643 Moore Building, University Park, PA 16802, USA.
| | | |
Collapse
|
29
|
Schmidt-Kassow M, Roncaglia-Denissen MP, Kotz SA. Why pitch sensitivity matters: event-related potential evidence of metric and syntactic violation detection among spanish late learners of german. Front Psychol 2011; 2:131. [PMID: 21734898 PMCID: PMC3120976 DOI: 10.3389/fpsyg.2011.00131] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2011] [Accepted: 06/05/2011] [Indexed: 11/13/2022] Open
Abstract
Event-related potential (ERP) data in monolingual German speakers have shown that sentential metric expectancy violations elicit a biphasic ERP pattern consisting of an anterior negativity and a posterior positivity (P600). This pattern is comparable to that elicited by syntactic violations. However, proficient French late learners of German do not detect violations of metric expectancy in German. They also show qualitatively and quantitatively different ERP responses to metric and syntactic violations. We followed up the questions whether (1) latter evidence results from a potential pitch cue insensitivity in speech segmentation in French speakers, or (2) if the result is founded in rhythmic language differences. Therefore, we tested Spanish late learners of German, as Spanish, contrary to French, uses pitch as a segmentation cue even though the basic segmentation unit is the same in French and Spanish (i.e., the syllable). We report ERP responses showing that Spanish L2 learners are sensitive to syntactic as well as metric violations in German sentences independent of attention to task in a P600 response. Overall, the behavioral performance resembles that of German native speakers. The current data suggest that Spanish L2 learners are able to extract metric units (trochee) in their L2 (German) even though their basic segmentation unit in Spanish is the syllable. In addition Spanish in contrast to French L2 learners of German are sensitive to syntactic violations indicating a tight link between syntactic and metric competence. This finding emphasizes the relevant role of metric cues not only in L2 prosodic but also in syntactic processing.
Collapse
Affiliation(s)
- Maren Schmidt-Kassow
- Institute of Medical Psychology, Goethe University Frankfurt Frankfurt am Main, Germany
| | | | | |
Collapse
|
30
|
Rodríguez-Fornells A, Cunillera T, Mestres-Missé A, de Diego-Balaguer R. Neurophysiological mechanisms involved in language learning in adults. Philos Trans R Soc Lond B Biol Sci 2009; 364:3711-35. [PMID: 19933142 PMCID: PMC2846313 DOI: 10.1098/rstb.2009.0130] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Little is known about the brain mechanisms involved in word learning during infancy and in second language acquisition and about the way these new words become stable representations that sustain language processing. In several studies we have adopted the human simulation perspective, studying the effects of brain-lesions and combining different neuroimaging techniques such as event-related potentials and functional magnetic resonance imaging in order to examine the language learning (LL) process. In the present article, we review this evidence focusing on how different brain signatures relate to (i) the extraction of words from speech, (ii) the discovery of their embedded grammatical structure, and (iii) how meaning derived from verbal contexts can inform us about the cognitive mechanisms underlying the learning process. We compile these findings and frame them into an integrative neurophysiological model that tries to delineate the major neural networks that might be involved in the initial stages of LL. Finally, we propose that LL simulations can help us to understand natural language processing and how the recovery from language disorders in infants and adults can be accomplished.
Collapse
|
31
|
Sanders LD, Neville HJ, Woldorff MG. Speech segmentation by native and non-native speakers: the use of lexical, syntactic, and stress-pattern cues. J Speech Lang Hear Res 2002; 45:519-530. [PMID: 12069004 PMCID: PMC2532534 DOI: 10.1044/1092-4388(2002/041)] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Varying degrees of plasticity in different subsystems of language have been demonstrated by studies showing that some aspects of language are processed similarly by native speakers and late-learners whereas other aspects are processed differently by the two groups. The study of speech segmentation provides a means by which the ability to process different types of linguistic information can be measured within the same task, because lexical, syntactic, and stress-pattern information can all indicate where one word ends and the next begins in continuous speech. In this study, native Japanese and native Spanish late-learners of English (as well as near-monolingual Japanese and Spanish speakers) were asked to determine whether specific sounds fell at the beginning or in the middle of words in English sentences. Similar to native English speakers, late-learners employed lexical information to perform the segmentation task. However, nonnative speakers did not use syntactic information to the same extent as native English speakers. Although both groups of late-learners of English used stress pattern as a segmentation cue, the extent to which this cue was relied upon depended on the stress-pattern characteristics of their native language. These findings support the hypothesis that learning a second language later in life has differential effects on subsystems within language.
Collapse
|