1
|
Henriksen N, Greenley S, Galvano A. Sociophonetic Investigation of the Spanish Alveolar Trill /r/ in Two Canonical-Trill Varieties. LANGUAGE AND SPEECH 2023; 66:896-934. [PMID: 36573543 DOI: 10.1177/00238309221137326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
The "hyper-variation" present in rhotic sounds makes them particularly apt for sociophonetic research. This paper investigates the variable realization of the voiced alveolar-trill phoneme /r/ through an acoustic analysis of unscripted speech produced by 80 speakers of Spanish. Although the most common phonetic variant of /r/ contained two lingual constrictions, we find substantial inter-speaker variation in our data, ranging from zero to five lingual contacts. The results demonstrate that the variation in Spanish results from a systematic interaction of factors, deriving from well-documented processes of consonantal lenition (e.g., weakening in unstressed syllables) in addition to processes inherent to the trill's articulation (e.g., high-vowel antagonism). Importantly, speaker sex displayed the strongest effect among all the predictors, which leads us to consider the role of sociolinguistic factors, in addition to possible biomechanical differences, on /r/ production. We contextualize the findings within a literature that theorizes rhotic consonants as a single class of sounds despite remarkable patterns of cross-language and speaker-specific variation.
Collapse
|
2
|
Domain-Adversarial Based Model with Phonological Knowledge for Cross-Lingual Speech Recognition. ELECTRONICS 2021. [DOI: 10.3390/electronics10243172] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Phonological-based features (articulatory features, AFs) describe the movements of the vocal organ which are shared across languages. This paper investigates a domain-adversarial neural network (DANN) to extract reliable AFs, and different multi-stream techniques are used for cross-lingual speech recognition. First, a novel universal phonological attributes definition is proposed for Mandarin, English, German and French. Then a DANN-based AFs detector is trained using source languages (English, German and French). When doing the cross-lingual speech recognition, the AFs detectors are used to transfer the phonological knowledge from source languages (English, German and French) to the target language (Mandarin). Two multi-stream approaches are introduced to fuse the acoustic features and cross-lingual AFs. In addition, the monolingual AFs system (i.e., the AFs are directly extracted from the target language) is also investigated. Experiments show that the performance of the AFs detector can be improved by using convolutional neural networks (CNN) with a domain-adversarial learning method. The multi-head attention (MHA) based multi-stream can reach the best performance compared to the baseline, cross-lingual adaptation approach, and other approaches. More specifically, the MHA-mode with cross-lingual AFs yields significant improvements over monolingual AFs with the restriction of training data size and, which can be easily extended to other low-resource languages.
Collapse
|
3
|
Figueroa Saavedra C, Otzen Hernández T, Alarcón Godoy C, Ríos Pérez A, Frugone Salinas D, Lagos Hernández R. Association between suicidal ideation and acoustic parameters of university students' voice and speech: a pilot study. LOGOP PHONIATR VOCO 2020; 46:55-62. [PMID: 32138570 DOI: 10.1080/14015439.2020.1733075] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
PURPOSE At a worldwide level, suicide is a public health problem that, despite displaying downward trends in several areas of the world, in many countries these rates have increased. One of the elements that contributes to its prevention is an early and dynamic evaluation. Due to this, the objective is to determine the association between acoustic parameters of voice and speech (F0, F1, F2, F3, dB, and Jitter) and suicidal ideation arousal amongst some university students from the city of Temuco, Chile. METHODS Attending to this issue, a cross-sectional design study was conducted through a non-probabilistic sampling of sixty 18- and 19-year-old adolescents from the city of Temuco, that went through an acoustic evaluation of their voice and speech after taking a test to determine suicidal ideation. Afterwards, data were analyzed through IBM SPSS version 23.0 software (IBM SPSS Statistics, Armonk, NY), by means of exploratory, descriptive, and inferential statistics taking the variable's levels of measurements and the types of distributions into account. RESULTS The results point out that 30% of the adolescents, from both genders, displayed suicidal ideation. Taking into account the acoustic results of their voice, it is possible to recognize that the fundamental frequency (F0), the formants (F1, F2), and Jitter, are the ones that majorly link to the presence of suicidal ideation, both in women and men (p < .05). The characteristics that describe F3 were only linked to the presence of suicidal ideation in men (p < .05). CONCLUSIONS It is concluded that the acoustic parameters of voice and speech differ in adolescents with suicidal behavior, opening the possibility of representing a useful tool in the diagnosis of suicide.
Collapse
Affiliation(s)
- Carla Figueroa Saavedra
- Carrera de Fonoaudiología, Universidad Autónoma de Chile sede Temuco, Temuco, Chile.,Programa de Doctorado en Ciencias Médicas, Universidad de La Frontera, Temuco, Chile
| | | | - Camila Alarcón Godoy
- Carrera de Fonoaudiología, Universidad Autónoma de Chile sede Temuco, Temuco, Chile
| | - Arlette Ríos Pérez
- Carrera de Fonoaudiología, Universidad Autónoma de Chile sede Temuco, Temuco, Chile
| | | | | |
Collapse
|
4
|
Meltzner GS, Heaton JT, Deng Y, De Luca G, Roy SH, Kline JC. Silent Speech Recognition as an Alternative Communication Device for Persons with Laryngectomy. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2017; 25:2386-2398. [PMID: 29552581 PMCID: PMC5851476 DOI: 10.1109/taslp.2017.2740000] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Each year thousands of individuals require surgical removal of their larynx (voice box) due to trauma or disease, and thereby require an alternative voice source or assistive device to verbally communicate. Although natural voice is lost after laryngectomy, most muscles controlling speech articulation remain intact. Surface electromyographic (sEMG) activity of speech musculature can be recorded from the neck and face, and used for automatic speech recognition to provide speech-to-text or synthesized speech as an alternative means of communication. This is true even when speech is mouthed or spoken in a silent (subvocal) manner, making it an appropriate communication platform after laryngectomy. In this study, 8 individuals at least 6 months after total laryngectomy were recorded using 8 sEMG sensors on their face (4) and neck (4) while reading phrases constructed from a 2,500-word vocabulary. A unique set of phrases were used for training phoneme-based recognition models for each of the 39 commonly used phonemes in English, and the remaining phrases were used for testing word recognition of the models based on phoneme identification from running speech. Word error rates were on average 10.3% for the full 8-sensor set (averaging 9.5% for the top 4 participants), and 13.6% when reducing the sensor set to 4 locations per individual (n=7). This study provides a compelling proof-of-concept for sEMG-based alaryngeal speech recognition, with the strong potential to further improve recognition performance.
Collapse
Affiliation(s)
| | - James T Heaton
- Harvard Medical School in the Department of Surgery, Massachusetts General Hospital Voice Center, Boston, MA 02114
| | | | | | - Serge H Roy
- Delsys, Inc., and Altec, Inc., Natick MA 01760 USA
| | | |
Collapse
|
5
|
Gonzalez JA, Cheah LA, Gilbert JM, Bai J, Ell SR, Green PD, Moore RK. A silent speech system based on permanent magnet articulography and direct synthesis. COMPUT SPEECH LANG 2016. [DOI: 10.1016/j.csl.2016.02.002] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
6
|
Kreuzer W, Kasess CH. Tuning of vocal tract model parameters for nasals using sensitivity functions. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 137:1021-1031. [PMID: 25698033 DOI: 10.1121/1.4906158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Determining the cross-sectional areas of the vocal tract models from the linear predictive coding or autoregressive-moving-average analysis of speech signals from vowels has been of research interest for several decades now. To tune the shape of the vocal tract to given sets of formant frequencies, iterative methods using sensitivity functions have been developed. In this paper, the idea of sensitivity functions is expanded to a three-tube model used in connection with nasals, and the energy-based sensitivity function is compared with a Jacobian-based sensitivity function for the branched-tube model. It is shown that the difference between both functions is negligible if the sensitivity is taken with respect to the formant frequency only. Results for an iterative tuning a three-tube vocal tract model based on the sensitivity functions for a nasal (/m/) are given. It is shown that besides the polar angle, the absolute value of the poles and zeros of the rational transfer function also needs to be considered in the tuning process. To test the effectiveness of the iterative solver, the steepest descent method is compared with the Gauss-Newton method. It is shown, that the Gauss-Newton method converges faster if a good starting value for the iteration is given.
Collapse
Affiliation(s)
- W Kreuzer
- Acoustics Research Institute, Austrian Academy of Sciences, Wohllebengasse 12-14, A-1040 Vienna, Austria
| | - C H Kasess
- Acoustics Research Institute, Austrian Academy of Sciences, Wohllebengasse 12-14, A-1040 Vienna, Austria
| |
Collapse
|
7
|
Huang G, Er MJ. An adaptive neural control scheme for articulatory synthesis of CV sequences. COMPUT SPEECH LANG 2014. [DOI: 10.1016/j.csl.2013.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
8
|
Demange S, Ouni S. An episodic memory-based solution for the acoustic-to-articulatory inversion problem. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2013; 133:2921-2930. [PMID: 23654397 DOI: 10.1121/1.4798665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
This paper presents an acoustic-to-articulatory inversion method based on an episodic memory. An episodic memory is an interesting model for two reasons. First, it does not rely on any assumptions about the mapping function but rather it relies on real synchronized acoustic and articulatory data streams. Second, the memory inherently represents the real articulatory dynamics as observed. It is argued that the computational models of episodic memory, as they are usually designed, cannot provide a satisfying solution for the acoustic-to-articulatory inversion problem due to the insufficient quantity of training data. Therefore, an episodic memory is proposed, called generative episodic memory (G-Mem), which is able to produce articulatory trajectories that do not belong to the set of episodes the memory is based on. The generative episodic memory is evaluated using two electromagnetic articulography corpora: one for English and one for French. Comparisons with a codebook-based method and with a classical episodic memory (which is termed concatenative episodic memory) are presented in order to evaluate the proposed generative episodic memory in terms of both its modeling of articulatory dynamics and its generalization capabilities. The results show the effectiveness of the method where an overall root-mean-square error of 1.65 mm and a correlation of 0.71 are obtained for the G-Mem method. They are comparable to those of methods recently proposed.
Collapse
Affiliation(s)
- Sébastien Demange
- Université de Lorraine, Laboratoire Lorrain de Recherche en Informatique et ses Applications, Unité de Recherche Mixte 7503, Vandœuvre-lès-Nancy, F-54506, France
| | | |
Collapse
|
9
|
Lammert A, Goldstein L, Narayanan S, Iskarous K. Statistical Methods for Estimation of Direct and Differential Kinematics of the Vocal Tract. SPEECH COMMUNICATION 2013; 55:147-161. [PMID: 24052685 PMCID: PMC3774006 DOI: 10.1016/j.specom.2012.08.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
We present and evaluate two statistical methods for estimating kinematic relationships of the speech production system: Artificial Neural Networks and Locally-Weighted Regression. The work is motivated by the need to characterize this motor system, with particular focus on estimating differential aspects of kinematics. Kinematic analysis will facilitate progress in a variety of areas, including the nature of speech production goals, articulatory redundancy and, relatedly, acoustic-to-articulatory inversion. Statistical methods must be used to estimate these relationships from data since they are infeasible to express in closed form. Statistical models are optimized and evaluated - using a heldout data validation procedure - on two sets of synthetic speech data. The theoretical and practical advantages of both methods are also discussed. It is shown that both direct and differential kinematics can be estimated with high accuracy, even for complex, nonlinear relationships. Locally-Weighted Regression displays the best overall performance, which may be due to practical advantages in its training procedure. Moreover, accurate estimation can be achieved using only a modest amount of training data, as judged by convergence of performance. The algorithms are also applied to real-time MRI data, and the results are generally consistent with those obtained from synthetic data.
Collapse
Affiliation(s)
- Adam Lammert
- Signal Analysis & Interpretation Laboratory (SAIL), University of Southern California, 3710 McClintock Ave., Los Angeles, CA 90089, USA
| | - Louis Goldstein
- Department of Linguistics, University of Southern California, Grace Ford Salvatory 301, Los Angeles, CA 90089-1693, USA
- Haskins Laboratories, 300 George Street, Suite 900, New Haven, CT, 06511, USA
| | - Shrikanth Narayanan
- Signal Analysis & Interpretation Laboratory (SAIL), University of Southern California, 3710 McClintock Ave., Los Angeles, CA 90089, USA
- Department of Linguistics, University of Southern California, Grace Ford Salvatory 301, Los Angeles, CA 90089-1693, USA
| | - Khalil Iskarous
- Department of Linguistics, University of Southern California, Grace Ford Salvatory 301, Los Angeles, CA 90089-1693, USA
- Haskins Laboratories, 300 George Street, Suite 900, New Haven, CT, 06511, USA
| |
Collapse
|
10
|
Toutios A, Ouni S, Laprie Y. Estimating the control parameters of an articulatory model from electromagnetic articulograph data. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2011; 129:3245-3257. [PMID: 21568426 DOI: 10.1121/1.3569714] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Finding the control parameters of an articulatory model that result in given acoustics is an important problem in speech research. However, one should also be able to derive the same parameters from measured articulatory data. In this paper, a method to estimate the control parameters of the the model by Maeda from electromagnetic articulography (EMA) data, which allows the derivation of full sagittal vocal tract slices from sparse flesh-point information, is presented. First, the articulatory grid system involved in the model's definition is adapted to the speaker involved in the experiment, and EMA data are registered to it automatically. Then, articulatory variables that correspond to measurements defined by Maeda on the grid are extracted. An initial solution for the articulatory control parameters is found by a least-squares method, under constraints ensuring vocal tract shape naturalness. Dynamic smoothness of the parameter trajectories is then imposed by a variational regularization method. Generated vocal tract slices for vowels are compared with slices appearing in magnetic resonance images of the same speaker or found in the literature. Formants synthesized on the basis of these generated slices are adequately close to those tracked in real speech recorded concurrently with EMA.
Collapse
Affiliation(s)
- Asterios Toutios
- Laboratoire Lorrain de Recherche en Informatique et ses Applications, Unité de Recherche Mixte 7503, Boîte Postale 239, 54506 Vandœuvre-lès-Nancy Cedex, France.
| | | | | |
Collapse
|
11
|
Panchapagesan S, Alwan A. A study of acoustic-to-articulatory inversion of speech by analysis-by-synthesis using chain matrices and the Maeda articulatory model. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2011; 129:2144-2162. [PMID: 21476670 PMCID: PMC3188964 DOI: 10.1121/1.3514544] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/02/2009] [Revised: 10/05/2010] [Accepted: 10/19/2010] [Indexed: 05/30/2023]
Abstract
In this paper, a quantitative study of acoustic-to-articulatory inversion for vowel speech sounds by analysis-by-synthesis using the Maeda articulatory model is performed. For chain matrix calculation of vocal tract (VT) acoustics, the chain matrix derivatives with respect to area function are calculated and used in a quasi-Newton method for optimizing articulatory trajectories. The cost function includes a distance measure between natural and synthesized first three formants, and parameter regularization and continuity terms. Calibration of the Maeda model to two speakers, one male and one female, from the University of Wisconsin x-ray microbeam (XRMB) database, using a cost function, is discussed. Model adaptation includes scaling the overall VT and the pharyngeal region and modifying the outer VT outline using measured palate and pharyngeal traces. The inversion optimization is initialized by a fast search of an articulatory codebook, which was pruned using XRMB data to improve inversion results. Good agreement between estimated midsagittal VT outlines and measured XRMB tongue pellet positions was achieved for several vowels and diphthongs for the male speaker, with average pellet-VT outline distances around 0.15 cm, smooth articulatory trajectories, and less than 1% average error in the first three formants.
Collapse
Affiliation(s)
- Sankaran Panchapagesan
- Department of Electrical Engineering, University of California, Los Angeles, California 90095, USA.
| | | |
Collapse
|
12
|
Turicchia L, Sarpeshkar R. An analog integrated-circuit vocal tract. IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2008; 2:316-327. [PMID: 23853134 DOI: 10.1109/tbcas.2008.2005296] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
We present the first experimental integrated-circuit vocal tract by mapping fluid volume velocity to current, fluid pressure to voltage, and linear and nonlinear mechanical impedances to linear and nonlinear electrical impedances. The 275 muW analog vocal tract chip includes a 16-stage cascade of two-port pi-elements that forms a tunable transmission line, electronically variable impedances, and a current source as the glottal source. A nonlinear resistor models laminar and turbulent flow in the vocal tract. The measured SNR at the output of the analog vocal tract is 64, 66, and 63 dB for the first three formant resonances of a vocal tract with uniform cross-sectional area. The analog vocal tract can be used with auditory processors in a feedback speech locked loop-analogous to a phase locked loop-to implement speech recognition that is potentially robust in noise. Our use of a physiological model of the human vocal tract enables the analog vocal tract chip to synthesize speech signals of interest, using articulatory parameters that are intrinsically compact and linearly interpolatable.
Collapse
|
13
|
|
14
|
Potard B, Laprie Y, Ouni S. Incorporation of phonetic constraints in acoustic-to-articulatory inversion. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2008; 123:2310-2323. [PMID: 18397035 DOI: 10.1121/1.2885747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
This study investigates the use of constraints upon articulatory parameters in the context of acoustic-to-articulatory inversion. These speaker independent constraints, referred to as phonetic constraints, were derived from standard phonetic knowledge for French vowels and express authorized domains for one or several articulatory parameters. They were experimented on in an existing inversion framework that utilizes Maeda's articulatory model and a hypercubic articulatory-acoustic table. Phonetic constraints give rise to a phonetic score rendering the phonetic consistency of vocal tract shapes recovered by inversion. Inversion has been applied to vowels articulated by a speaker whose corresponding x-ray images are also available. Constraints were evaluated by measuring the distance between vocal tract shapes recovered through inversion to real vocal tract shapes obtained from x-ray images, by investigating the spreading of inverse solutions in terms of place of articulation and constriction degree, and finally by studying the articulatory variability. Results show that these constraints capture interdependencies and synergies between speech articulators and favor vocal tract shapes close to those realized by the human speaker. In addition, this study also provides how acoustic-to-articulatory inversion can be used to explore acoustical and compensatory articulatory properties of an articulatory model.
Collapse
Affiliation(s)
- Blaise Potard
- Speech Team, LORIA, UMR 7503, BP 239, 54506 Vandoeoeuvre-lès-Nancy Cedex, France.
| | | | | |
Collapse
|
15
|
Howard DM, Tyrrell AM, Murphy DT, Cooper C, Mullen J. Bio-inspired evolutionary oral tract shape modeling for physical modeling vocal synthesis. J Voice 2007; 23:11-20. [PMID: 17981014 DOI: 10.1016/j.jvoice.2007.03.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Accepted: 03/02/2007] [Indexed: 11/16/2022]
Abstract
Physical modeling using digital waveguide mesh (DWM) models is an audio synthesis method that has been shown to produce an acoustic output in music synthesis applications that is often described as being "organic," "warm," or "intimate." This paper describes work that takes its inspiration from physical modeling music synthesis and applies it to speech synthesis through a physical modeling mesh model of the human oral tract. Oral tract shapes are found using a computational technique based on the principles of biological evolution. Essential to successful speech synthesis using this method is accurate measurements of the cross-sectional area of the human oral tract, and these are usually derived from magnetic resonance imaging (MRI). However, such images are nonideal, because of the lengthy exposure time (relative to the time of articulation of speech sounds) required, the local ambient acoustic noise associated with the MRI machine itself and the required supine position for the subject. An alternative method is described where a bio-inspired computing technique that simulates the process of evolution is used to evolve oral tract shapes. This technique is able to produce appropriate oral tract shapes for open vowels using acoustic and excitation data from two adult males and two adult females, but shapes for close vowels that are less appropriate. This technique has none of the drawbacks associated with MRI, because all it requires from the subject is an acoustic and electrolaryngograph (or electroglottograph) recording. Appropriate oral tract shapes do enable the model to produce excellent quality synthetic speech for vowel sounds, and sounds that involve dynamic oral tract shape changes, such as diphthongs, can also be synthesized using an impedance mapped technique. Efforts to improve performance by reducing mesh quantization for close vowels had little effect, and further work is required.
Collapse
Affiliation(s)
- David M Howard
- Intelligent Systems Research Group, Department of Electronics, University of York, Heslington, York, United Kingdom.
| | | | | | | | | |
Collapse
|
16
|
Genetic learning of vocal tract area functions for articulatory synthesis of Spanish vowels. Appl Soft Comput 2007. [DOI: 10.1016/j.asoc.2006.05.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
17
|
Shah MS, Pandey PC. Estimation of vocal tract shape for VCV syllables for a speech training aid. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2005:6642-5. [PMID: 17281795 DOI: 10.1109/iembs.2005.1616025] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Display of vocal tract shape can be used in speech training aids for the hearing impaired children, as it provides a visual feedback of the articulatory efforts. Estimation of vocal tract shape, based on LPC and other analysis techniques, works satisfactorily for vowels but generally fails during stop closures. Indication of correct place of articulation is very important, particularly for non-labial consonants. In order to study the dynamics of the vocal tract shape estimation during transitions at vowel-consonant boundaries, we have used "areagram", a spectrogram-like two-dimensional (2D) display of estimated vocal tract cross-sectional area as a function of time and position along the tract length. Area estimation is based on reflection coefficients obtained from LPC analysis of speech. Based on estimated area during the transition segments preceding and following the stop closure, bivariate polynomial surfaces are obtained and these are used for estimation of the vocal tract shape during stop closure by performing 2D interpolation. The place of closure for various stop consonants could be estimated satisfactorily from the conic surface approximation.
Collapse
Affiliation(s)
- Milind S Shah
- Department of Electrical Engineering, IIT Bombay, Mumbai 400 076, India.
| | | |
Collapse
|
18
|
Adaptive Kalman Filtering and Smoothing for Tracking Vocal Tract Resonances Using a Continuous-Valued Hidden Dynamic Model. ACTA ACUST UNITED AC 2007. [DOI: 10.1109/tasl.2006.876724] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
19
|
Forbes BJ, Pike ER, Sharp DB, Aktosun T. Inverse potential scattering in duct acoustics. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2006; 119:65-73. [PMID: 16454265 DOI: 10.1121/1.2139618] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
The inverse problem of the noninvasive measurement of the shape of an acoustical duct in which one-dimensional wave propagation can be assumed is examined within the theoretical framework of the governing Klein-Gordon equation. Previous deterministic methods developed over the last 40 years have all required direct measurement of the reflectance or input impedance but now, by application of the methods of inverse quantum scattering to the acoustical system, it is shown that the reflectance can be algorithmically derived from the radiated wave. The potential and area functions of the duct can subsequently be reconstructed. The results are discussed with particular reference to acoustic pulse reflectometry.
Collapse
Affiliation(s)
- Barbara J Forbes
- Phonologica Ltd., PO. Box 43925, London NW2 1DJ, United Kingdom.
| | | | | | | |
Collapse
|
20
|
Hiroya S, Honda M. Estimation of Articulatory Movements From Speech Acoustics Using an HMM-Based Speech Production Model. ACTA ACUST UNITED AC 2004. [DOI: 10.1109/tsa.2003.822636] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
21
|
Deng L. Switching Dynamic System Models for Speech Articulation and Acoustics. MATHEMATICAL FOUNDATIONS OF SPEECH AND LANGUAGE PROCESSING 2004. [DOI: 10.1007/978-1-4419-9017-4_6] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
22
|
Girin L, Schwartz JL, Feng G. Audio-visual enhancement of speech in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2001; 109:3007-3020. [PMID: 11425143 DOI: 10.1121/1.1358887] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
A key problem for telecommunication or human-machine communication systems concerns speech enhancement in noise. In this domain, a certain number of techniques exist, all of them based on an acoustic-only approach--that is, the processing of the audio corrupted signal using audio information (from the corrupted signal only or additive audio information). In this paper, an audio-visual approach to the problem is considered, since it has been demonstrated in several studies that viewing the speaker's face improves message intelligibility, especially in noisy environments. A speech enhancement prototype system that takes advantage of visual inputs is developed. A filtering process approach is proposed that uses enhancement filters estimated with the help of lip shape information. The estimation process is based on linear regression or simple neural networks using a training corpus. A set of experiments assessed by Gaussian classification and perceptual tests demonstrates that it is indeed possible to enhance simple stimuli (vowel-plosive-vowel sequences) embedded in white Gaussian noise.
Collapse
Affiliation(s)
- L Girin
- Institut de la Communication Parlée, INPG/Université Stendhal/CNRS UMR 5009, Grenoble, France.
| | | | | |
Collapse
|
23
|
McGowan RS, Cushing S. Vocal tract normalization for midsagittal articulatory recovery with analysis-by-synthesis. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1999; 106:1090-1105. [PMID: 10462814 DOI: 10.1121/1.427117] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
A method is presented that accounts for differences in the acoustics of vowel production caused by human talkers' vocal-tract anatomies and postural settings. Such a method is needed by an analysis-by-synthesis procedure designed to recover midsagittal articulatory movement from speech acoustics because the procedure employs an articulatory model as an internal model. The normalization procedure involves the adjustment of parameters of the articulatory model that are not of interest for the midsagittal movement recovery procedure. These parameters are adjusted so that acoustic signals produced by the human and the articulatory model match as closely as possible over an initial set of pairs of corresponding human and model midsagittal shapes. Further, these initial midsagittal shape correspondence need to be generalized so that all midsagittal shapes of the human can be obtained from midsagittal shapes of the model. Once these procedures are complete, the midsagittal articulatory movement recovery algorithm can be used to derive model articulatory trajectories that, subsequently, can be transformed into human articulatory trajectories. In this paper the proposed normalization procedure is outlined and the results of experiments with data from two talkers contained in the X-ray Microbeam Speech Production Database are presented. It was found to be possible to characterize these vocal tracts during vowel production with the proposed procedure and to generalize the initial midsagittal correspondences over a set of vowels to other vowels. The procedure was also found to aid in midsagittal articulatory movement recovery from speech acoustics in a vowel-to-vowel production for the two subjects.
Collapse
Affiliation(s)
- R S McGowan
- Sensimetrics Corporation, Somerville, Massachusetts 02144, USA
| | | |
Collapse
|