1
|
Simantiraki O, Wagner AE, Cooke M. The impact of speech type on listening effort and intelligibility for native and non-native listeners. Front Neurosci 2023; 17:1235911. [PMID: 37841688 PMCID: PMC10568627 DOI: 10.3389/fnins.2023.1235911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 09/08/2023] [Indexed: 10/17/2023] Open
Abstract
Listeners are routinely exposed to many different types of speech, including artificially-enhanced and synthetic speech, styles which deviate to a greater or lesser extent from naturally-spoken exemplars. While the impact of differing speech types on intelligibility is well-studied, it is less clear how such types affect cognitive processing demands, and in particular whether those speech forms with the greatest intelligibility in noise have a commensurately lower listening effort. The current study measured intelligibility, self-reported listening effort, and a pupillometry-based measure of cognitive load for four distinct types of speech: (i) plain i.e. natural unmodified speech; (ii) Lombard speech, a naturally-enhanced form which occurs when speaking in the presence of noise; (iii) artificially-enhanced speech which involves spectral shaping and dynamic range compression; and (iv) speech synthesized from text. In the first experiment a cohort of 26 native listeners responded to the four speech types in three levels of speech-shaped noise. In a second experiment, 31 non-native listeners underwent the same procedure at more favorable signal-to-noise ratios, chosen since second language listening in noise has a more detrimental effect on intelligibility than listening in a first language. For both native and non-native listeners, artificially-enhanced speech was the most intelligible and led to the lowest subjective effort ratings, while the reverse was true for synthetic speech. However, pupil data suggested that Lombard speech elicited the lowest processing demands overall. These outcomes indicate that the relationship between intelligibility and cognitive processing demands is not a simple inverse, but is mediated by speech type. The findings of the current study motivate the search for speech modification algorithms that are optimized for both intelligibility and listening effort.
Collapse
Affiliation(s)
- Olympia Simantiraki
- Institute of Applied and Computational Mathematics, Foundation for Research & Technology-Hellas, Heraklion, Greece
| | - Anita E. Wagner
- Department of Otorhinolaryngology/Head and Neck Surgery, University Medical Center Groningen, University of Groningen, Groningen, Netherlands
| | - Martin Cooke
- Ikerbasque (Basque Science Foundation), Vitoria-Gasteiz, Spain
| |
Collapse
|
2
|
Kaur N, Singh P. Conventional and contemporary approaches used in text to speech synthesis: a review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10315-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
3
|
Cao B, Wisler A, Wang J. Speaker Adaptation on Articulation and Acoustics for Articulation-to-Speech Synthesis. SENSORS (BASEL, SWITZERLAND) 2022; 22:6056. [PMID: 36015817 PMCID: PMC9416444 DOI: 10.3390/s22166056] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/03/2022] [Accepted: 08/08/2022] [Indexed: 05/23/2023]
Abstract
Silent speech interfaces (SSIs) convert non-audio bio-signals, such as articulatory movement, to speech. This technology has the potential to recover the speech ability of individuals who have lost their voice but can still articulate (e.g., laryngectomees). Articulation-to-speech (ATS) synthesis is an algorithm design of SSI that has the advantages of easy-implementation and low-latency, and therefore is becoming more popular. Current ATS studies focus on speaker-dependent (SD) models to avoid large variations of articulatory patterns and acoustic features across speakers. However, these designs are limited by the small data size from individual speakers. Speaker adaptation designs that include multiple speakers' data have the potential to address the issue of limited data size from single speakers; however, few prior studies have investigated their performance in ATS. In this paper, we investigated speaker adaptation on both the input articulation and the output acoustic signals (with or without direct inclusion of data from test speakers) using the publicly available electromagnetic articulatory (EMA) dataset. We used Procrustes matching and voice conversion for articulation and voice adaptation, respectively. The performance of the ATS models was measured objectively by the mel-cepstral distortions (MCDs). The synthetic speech samples were generated and are provided in the supplementary material. The results demonstrated the improvement brought by both Procrustes matching and voice conversion on speaker-independent ATS. With the direct inclusion of target speaker data in the training process, the speaker-adaptive ATS achieved a comparable performance to speaker-dependent ATS. To our knowledge, this is the first study that has demonstrated that speaker-adaptive ATS can achieve a non-statistically different performance to speaker-dependent ATS.
Collapse
Affiliation(s)
- Beiming Cao
- Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78712, USA
- Department of Speech, Language, and Hearing Sciences, University of Texas at Austin, Austin, TX 78712, USA
| | - Alan Wisler
- Department of Mathematics and Statistics, Utah State University, Logan, UT 84322, USA
| | - Jun Wang
- Department of Speech, Language, and Hearing Sciences, University of Texas at Austin, Austin, TX 78712, USA
- Department of Neurology, Dell Medical School, University of Texas at Austin, Austin, TX 78712, USA
| |
Collapse
|
4
|
Zhu X, Xue L. Building a controllable expressive speech synthesis system with multiple emotion strengths. COGN SYST RES 2020. [DOI: 10.1016/j.cogsys.2019.09.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
5
|
Language-independent acoustic cloning of HTS voices. COMPUT SPEECH LANG 2019. [DOI: 10.1016/j.csl.2018.12.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
6
|
Bouchard KE, Conant DF, Anumanchipalli GK, Dichter B, Chaisanguanthum KS, Johnson K, Chang EF. High-Resolution, Non-Invasive Imaging of Upper Vocal Tract Articulators Compatible with Human Brain Recordings. PLoS One 2016; 11:e0151327. [PMID: 27019106 PMCID: PMC4809489 DOI: 10.1371/journal.pone.0151327] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2015] [Accepted: 02/27/2016] [Indexed: 11/29/2022] Open
Abstract
A complete neurobiological understanding of speech motor control requires determination of the relationship between simultaneously recorded neural activity and the kinematics of the lips, jaw, tongue, and larynx. Many speech articulators are internal to the vocal tract, and therefore simultaneously tracking the kinematics of all articulators is nontrivial—especially in the context of human electrophysiology recordings. Here, we describe a noninvasive, multi-modal imaging system to monitor vocal tract kinematics, demonstrate this system in six speakers during production of nine American English vowels, and provide new analysis of such data. Classification and regression analysis revealed considerable variability in the articulator-to-acoustic relationship across speakers. Non-negative matrix factorization extracted basis sets capturing vocal tract shapes allowing for higher vowel classification accuracy than traditional methods. Statistical speech synthesis generated speech from vocal tract measurements, and we demonstrate perceptual identification. We demonstrate the capacity to predict lip kinematics from ventral sensorimotor cortical activity. These results demonstrate a multi-modal system to non-invasively monitor articulator kinematics during speech production, describe novel analytic methods for relating kinematic data to speech acoustics, and provide the first decoding of speech kinematics from electrocorticography. These advances will be critical for understanding the cortical basis of speech production and the creation of vocal prosthetics.
Collapse
Affiliation(s)
- Kristofer E. Bouchard
- Biological Systems and Engineering Division & Computational Research Division, Lawrence Berkeley National Laboratories (LBNL), Berkeley, California, United States of America
- Department of Neurological Surgery, University of California San Francisco (UCSF), San Francisco, California, United States of America
| | - David F. Conant
- Department of Neurological Surgery, University of California San Francisco (UCSF), San Francisco, California, United States of America
- Center for Integrative Neuroscience, UCSF, San Francisco, California, United States of America
| | - Gopala K. Anumanchipalli
- Department of Neurological Surgery, University of California San Francisco (UCSF), San Francisco, California, United States of America
- Center for Integrative Neuroscience, UCSF, San Francisco, California, United States of America
| | - Benjamin Dichter
- Department of Neurological Surgery, University of California San Francisco (UCSF), San Francisco, California, United States of America
- Center for Integrative Neuroscience, UCSF, San Francisco, California, United States of America
| | - Kris S. Chaisanguanthum
- Department of Neurological Surgery, University of California San Francisco (UCSF), San Francisco, California, United States of America
- Center for Integrative Neuroscience, UCSF, San Francisco, California, United States of America
| | - Keith Johnson
- Department of Linguistics, University of California (UCB), Berkeley, California, United States of America
| | - Edward F. Chang
- Department of Neurological Surgery, University of California San Francisco (UCSF), San Francisco, California, United States of America
- Center for Integrative Neuroscience, UCSF, San Francisco, California, United States of America
- * E-mail:
| |
Collapse
|
7
|
Mills T, Bunnell HT, Patel R. Towards Personalized Speech Synthesis for Augmentative and Alternative Communication. Augment Altern Commun 2014; 30:226-36. [DOI: 10.3109/07434618.2014.924026] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
8
|
Picart B, Drugman T, Dutoit T. HMM-based speech synthesis with various degrees of articulation: A perceptual study. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2012.10.040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
10
|
Dines J, Liang H, Saheer L, Gibson M, Byrne W, Oura K, Tokuda K, Yamagishi J, King S, Wester M, Hirsimäki T, Karhila R, Kurimo M. Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. COMPUT SPEECH LANG 2013. [DOI: 10.1016/j.csl.2011.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|