1
|
Barrientos E, Cataldo E. Estimating Formant Frequencies of Vowels Sung by Sopranos Using Weighted Linear Prediction. J Voice 2023:S0892-1997(23)00322-3. [PMID: 38000960 DOI: 10.1016/j.jvoice.2023.10.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 10/08/2023] [Accepted: 10/09/2023] [Indexed: 11/26/2023]
Abstract
This study introduces the weighted linear prediction adapted to high-pitched singing voices (WLP-HPSV) method for accurately estimating formant frequencies of vowels sung by lyric sopranos. The WLP-HPSV method employs a variant of the WLP analysis combined with the zero-frequency filtering (ZFF) technique to address specific challenges in formant estimation from singing signals. Evaluation of the WLP-HPSV method compared to the LPC method demonstrated its superior performance in accurately capturing the spectral characteristics of synthetic /u/ vowels and the /a/ and /u/ natural singing vowels. The QCP parameters used in the WLP-HPSV method varied with pitch, revealing insights into the interplay between the vocal tract and glottal characteristics during vowel production. The comparison between the LPC and WLP-HPSV methods highlighted the robustness of the WLP-HPSV method in accurately estimating formant frequencies across different pitches.
Collapse
Affiliation(s)
- Eduardo Barrientos
- Postgraduate Program in Electrical and Telecommunications Engineering (PPGEET), R. Passo da Pátria, Niterói, RJ, Brazil.
| | - Edson Cataldo
- Postgraduate Program in Electrical and Telecommunications Engineering (PPGEET), R. Passo da Pátria, Niterói, RJ, Brazil
| |
Collapse
|
2
|
Grawunder S, Uomini N, Samuni L, Bortolato T, Girard-Buttoz C, Wittig RM, Crockford C. Correction: 'Chimpanzee vowel-like sounds and voice quality suggest formant space expansion through the hominoid lineage' (2021), by Grawunder et al.. Philos Trans R Soc Lond B Biol Sci 2023; 378:20230319. [PMID: 37778391 PMCID: PMC10542441 DOI: 10.1098/rstb.2023.0319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 08/16/2023] [Indexed: 10/03/2023] Open
Affiliation(s)
- Sven Grawunder
- Department of Human Behavioural Ecology, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
- Department of Empirical Linguistics, Goethe University, 60323 Frankfurt am Main, Germany
| | - Natalie Uomini
- Department of Linguistic and Cultural Evolution, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
| | - Liran Samuni
- Department of Human Evolutionary Biology, Harvard University, Cambridge, MA, MCZ 540B, USA
- Tai Chimpanzee Project, Centre Swiss des Recherches Scientifiques, 01 BP 1303, Abidjan, Côte d'Ivoire
| | - Tatiana Bortolato
- Department of Human Behavioural Ecology, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
- The Ape Social Mind Lab, Institut des Sciences Cognitives, CNRS, 67 Boulevard Pinel, 69675 Bron, Lyon, France
- Tai Chimpanzee Project, Centre Swiss des Recherches Scientifiques, 01 BP 1303, Abidjan, Côte d'Ivoire
| | - Cédric Girard-Buttoz
- Department of Human Behavioural Ecology, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
- The Ape Social Mind Lab, Institut des Sciences Cognitives, CNRS, 67 Boulevard Pinel, 69675 Bron, Lyon, France
- Tai Chimpanzee Project, Centre Swiss des Recherches Scientifiques, 01 BP 1303, Abidjan, Côte d'Ivoire
| | - Roman M. Wittig
- Department of Human Behavioural Ecology, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
- Tai Chimpanzee Project, Centre Swiss des Recherches Scientifiques, 01 BP 1303, Abidjan, Côte d'Ivoire
| | - Catherine Crockford
- Department of Human Behavioural Ecology, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
- The Ape Social Mind Lab, Institut des Sciences Cognitives, CNRS, 67 Boulevard Pinel, 69675 Bron, Lyon, France
- Tai Chimpanzee Project, Centre Swiss des Recherches Scientifiques, 01 BP 1303, Abidjan, Côte d'Ivoire
| |
Collapse
|
3
|
Alku P, Kadiri SR, Gowda D. Refining a deep learning-based formant tracker using linear prediction methods. COMPUT SPEECH LANG 2023. [DOI: 10.1016/j.csl.2023.101515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
|
4
|
Ape Vowel-like Sounds Remain Elusive: A Comment on Grawunder et al. (2022). INT J PRIMATOL 2022. [DOI: 10.1007/s10764-022-00335-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
5
|
Whalen DH, Chen WR, Shadle CH, Fulop SA. Formants are easy to measure; resonances, not so much: Lessons from Klatt (1986). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:933. [PMID: 36050157 PMCID: PMC9374483 DOI: 10.1121/10.0013410] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
Formants in speech signals are easily identified, largely because formants are defined to be local maxima in the wideband sound spectrum. Sadly, this is not what is of most interest in analyzing speech; instead, resonances of the vocal tract are of interest, and they are much harder to measure. Klatt [(1986). in Proceedings of the Montreal Satellite Symposium on Speech Recognition, 12th International Congress on Acoustics, edited by P. Mermelstein (Canadian Acoustical Society, Montreal), pp. 5-7] showed that estimates of resonances are biased by harmonics while the human ear is not. Several analysis techniques placed the formant closer to a strong harmonic than to the center of the resonance. This "harmonic attraction" can persist with newer algorithms and in hand measurements, and systematic errors can persist even in large corpora. Research has shown that the reassigned spectrogram is less subject to these errors than linear predictive coding and similar measures, but it has not been satisfactorily automated, making its wider use unrealistic. Pending better techniques, the recommendations are (1) acknowledge limitations of current analyses regarding influence of F0 and limits on granularity, (2) report settings more fully, (3) justify settings chosen, and (4) examine the pattern of F0 vs F1 for possible harmonic bias.
Collapse
Affiliation(s)
- D H Whalen
- Haskins Laboratories, New Haven, Connecticut 06511, USA
| | - Wei-Rong Chen
- Haskins Laboratories, New Haven, Connecticut 06511, USA
| | | | - Sean A Fulop
- Department of Linguistics, California State University Fresno, Fresno, California 93740, USA
| |
Collapse
|
6
|
Abstract
The human voice carries socially relevant information such as how authoritative, dominant, and attractive the speaker sounds. However, some speakers may be able to manipulate listeners by modulating the shape and size of their vocal tract to exaggerate certain characteristics of their voice. We analysed the veridical size of speakers’ vocal tracts using real-time magnetic resonance imaging as they volitionally modulated their voice to sound larger or smaller, corresponding changes to the size implied by the acoustics of their voice, and their influence over the perceptions of listeners. Individual differences in this ability were marked, spanning from nearly incapable to nearly perfect vocal modulation, and was consistent across modalities of measurement. Further research is needed to determine whether speakers who are effective at vocal size exaggeration are better able to manipulate their social environment, and whether this variation is an inherited quality of the individual, or the result of life experiences such as vocal training.
Collapse
|
7
|
Kadiri SR, Alku P. Glottal features for classification of phonation type from speech and neck surface accelerometer signals. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2021.101232] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
8
|
Saldías M, Laukkanen AM, Guzmán M, Miranda G, Stoney J, Alku P, Sundberg J. The Vocal Tract in Loud Twang-Like Singing While Producing High and Low Pitches. J Voice 2021; 35:807.e1-807.e23. [DOI: 10.1016/j.jvoice.2020.02.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 02/04/2020] [Accepted: 02/06/2020] [Indexed: 11/25/2022]
|
9
|
Lynn E, Narayanan SS, Lammert AC. Dark tone quality and vocal tract shaping in soprano song production: Insights from real-time MRI. JASA EXPRESS LETTERS 2021; 1:075202. [PMID: 34291230 PMCID: PMC8273971 DOI: 10.1121/10.0005109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 05/10/2021] [Indexed: 06/13/2023]
Abstract
Tone quality termed "dark" is an aesthetically important property of Western classical voice performance and has been associated with lowered formant frequencies, lowered larynx, and widened pharynx. The present study uses real-time magnetic resonance imaging with synchronous audio recordings to investigate dark tone quality in four professionally trained sopranos with enhanced ecological validity and a relatively complete view of the vocal tract. Findings differ from traditional accounts, indicating that labial narrowing may be the primary driver of dark tone quality across performers, while many other aspects of vocal tract shaping are shown to differ significantly in a performer-specific way.
Collapse
Affiliation(s)
- Elisabeth Lynn
- Department of Biomedical Engineering, Worcester Polytechnic Institute, Worcester, Massachusetts 01690, USA
| | - Shrikanth S Narayanan
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, California 95616, USA , ,
| | - Adam C Lammert
- Department of Biomedical Engineering, Worcester Polytechnic Institute, Worcester, Massachusetts 01690, USA
| |
Collapse
|
10
|
Xue Y, Marxen M, Akagi M, Birkholz P. Acoustic and articulatory analysis and synthesis of shouted vowels. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2020.101156] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
11
|
Milenkovic PH, Wagner M, Kent RD, Story BH, Vorperian HK. Effects of sampling rate and type of anti-aliasing filter on linear-predictive estimates of formant frequencies in men, women, and children. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:EL221. [PMID: 32237805 PMCID: PMC7056453 DOI: 10.1121/10.0000824] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Revised: 01/31/2020] [Accepted: 02/10/2020] [Indexed: 06/11/2023]
Abstract
The purpose of this study was to assess the effect of downsampling the acoustic signal on the accuracy of linear-predictive (LPC) formant estimation. Based on speech produced by men, women, and children, the first four formant frequencies were estimated at sampling rates of 48, 16, and 10 kHz using different anti-alias filtering. With proper selection of number of LPC coefficients, anti-alias filter and between-frame averaging, results suggest that accuracy is not improved by rates substantially below 48 kHz. Any downsampling should not go below 16 kHz with a filter cut-off centered at 8 kHz.
Collapse
Affiliation(s)
- Paul H Milenkovic
- Department of Electrical and Computer Engineering, University of Wisconsin-Madison, 1415 Engineering Drive, Madison, Wisconsin 53706, USA
| | - Madison Wagner
- Vocal Tract Development Laboratory, Waisman Center, University of Wisconsin-Madison, 1500 Highland Avenue, Madison, Wisconsin 53706, USA
| | - Raymond D Kent
- Vocal Tract Development Laboratory, Waisman Center, University of Wisconsin-Madison, 1500 Highland Avenue, Madison, Wisconsin 53706, USA
| | - Brad H Story
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85718, , , , ,
| | - Houri K Vorperian
- Vocal Tract Development Laboratory, Waisman Center, University of Wisconsin-Madison, 1500 Highland Avenue, Madison, Wisconsin 53706, USA
| |
Collapse
|
12
|
Hosbach-Cannon CJ, Lowell SY, Colton RH, Kelley RT, Bao X. Assessment of Tongue Position and Laryngeal Height in Two Professional Voice Populations. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2020; 63:109-124. [PMID: 31944876 DOI: 10.1044/2019_jslhr-19-00164] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Purpose To advance our current knowledge of singer physiology by using ultrasonography in combination with acoustic measures to compare physiological differences between musical theater (MT) and opera (OP) singers under controlled phonation conditions. Primary objectives addressed in this study were (a) to determine if differences in hyolaryngeal and vocal fold contact dynamics occur between two professional voice populations (MT and OP) during singing tasks and (b) to determine if differences occur between MT and OP singers in oral configuration and associated acoustic resonance during singing tasks. Method Twenty-one singers (10 MT and 11 OP) were included. All participants were currently enrolled in a music program. Experimental procedures consisted of sustained phonation on the vowels /i/ and /ɑ/ during both a low-pitch task and a high-pitch task. Measures of hyolaryngeal elevation, tongue height, and tongue advancement were assessed using ultrasonography. Vocal fold contact dynamics were measured using electroglottography. Simultaneous acoustic recordings were obtained during all ultrasonography procedures for analysis of the first two formant frequencies. Results Significant oral configuration differences, reflected by measures of tongue height and tongue advancement, were seen between groups. Measures of acoustic resonance also showed significant differences between groups during specific tasks. Both singer groups significantly raised their hyoid position when singing high-pitched vowels, but hyoid elevation was not statistically different between groups. Likewise, vocal fold contact dynamics did not significantly differentiate the two singer groups. Conclusions These findings suggest that, under controlled phonation conditions, MT singers alter their oral configuration and achieve differing resultant formants as compared with OP singers. Because singers are at a high risk of developing a voice disorder, understanding how these two groups of singers adjust their vocal tract configuration during their specific singing genre may help to identify risky vocal behavior and provide a basis for prevention of voice disorders.
Collapse
Affiliation(s)
| | - Soren Y Lowell
- Department of Communication Sciences and Disorders, Syracuse University, NY
| | - Raymond H Colton
- Department of Communication Sciences and Disorders, Syracuse University, NY
| | - Richard T Kelley
- Department of Otolaryngology, Upstate Medical University, Syracuse, NY
| | - Xue Bao
- Department of Speech-Language Pathology, MGH-IHP, Boston, MA
| |
Collapse
|
13
|
Alku P, Murtola T, Malinen J, Geneid A, Vilkman E. Skewing of the glottal flow with respect to the glottal area measured in natural production of vowels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:2501. [PMID: 31671985 DOI: 10.1121/1.5129121] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Accepted: 09/24/2019] [Indexed: 06/10/2023]
Abstract
In the production of voiced speech, glottal flow skewing refers to the tilting of the glottal flow pulses to the right, often characterized as a delay of the peak, compared to the glottal area. In the past four decades, several studies have addressed this phenomenon using modeling of voice production with analog circuits and computer simulations. However, previous studies measuring flow skewing in natural production of speech are sparse and they contain little quantitative data about the degree of skewing between flow and area. In the current study, flow skewing was measured from the natural production of 40 vowel utterances produced by 10 speakers. Glottal flow was measured from speech using glottal inverse filtering and glottal area was captured with high-speed videoendoscopy. The estimated glottal flow and area waveforms were parameterized with four robust parameters that measure pulse skewness quantitatively. Statistical tests obtained for all four parameters showed that the flow pulse was significantly more skewed to the right than the area pulse. Hence, this study corroborates the existence of flow skewing using measurements from natural speech production. In addition, the study yields quantitative data about pulse skewness in simultaneous measured glottal flow and area in natural production of speech.
Collapse
Affiliation(s)
- Paavo Alku
- Department of Signal Processing and Acoustics, Aalto University, Espoo, FI-00076, Finland
| | - Tiina Murtola
- Department of Signal Processing and Acoustics, Aalto University, Espoo, FI-00076, Finland
| | - Jarmo Malinen
- Department of Mathematics and Systems Analysis, Aalto University, Espoo, FI-00076, Finland
| | - Ahmed Geneid
- Department of Otorhinolaryngology and Phoniatrics-Head and Neck Surgery, Helsinki University Hospital and University of Helsinki, Helsinki, FI-00240, Finland
| | - Erkki Vilkman
- Department of Otorhinolaryngology and Phoniatrics-Head and Neck Surgery, Helsinki University Hospital and University of Helsinki, Helsinki, FI-00240, Finland
| |
Collapse
|
14
|
Birkholz P, Gabriel F, Kürbis S, Echternach M. How the peak glottal area affects linear predictive coding-based formant estimates of vowels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:223. [PMID: 31370636 DOI: 10.1121/1.5116137] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 06/20/2019] [Indexed: 06/10/2023]
Abstract
The estimation of formant frequencies from acoustic speech signals is mostly based on Linear Predictive Coding (LPC) algorithms. Since LPC is based on the source-filter model of speech production, the formant frequencies obtained are often implicitly regarded as those for an infinite glottal impedance, i.e., a closed glottis. However, previous studies have indicated that LPC-based formant estimates of vowels generated with a realistically varying glottal area may substantially differ from the resonances of the vocal tract with a closed glottis. In the present study, the deviation between closed-glottis resonances and LPC-estimated formants during phonation with different peak glottal areas has been systematically examined both using physical vocal tract models excited with a self-oscillating rubber model of the vocal folds, and by computer simulations of interacting source and filter models. Ten vocal tract resonators representing different vowels have been analyzed. The results showed that F1 increased with the peak area of the time-varying glottis, while F2 and F3 were not systematically affected. The effect of the peak glottal area on F1 was strongest for close-mid to close vowels, and more moderate for mid to open vowels.
Collapse
Affiliation(s)
- Peter Birkholz
- Institute of Acoustics and Speech Communication, TU Dresden, 01062 Dresden, Germany
| | - Falk Gabriel
- Institute of Acoustics and Speech Communication, TU Dresden, 01062 Dresden, Germany
| | - Steffen Kürbis
- Institute of Acoustics and Speech Communication, TU Dresden, 01062 Dresden, Germany
| | - Matthias Echternach
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Munich University Hospital, LMU, Munich, Germany
| |
Collapse
|
15
|
Chen WR, Whalen DH, Shadle CH. F0-induced formant measurement errors result in biased variabilities. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:EL360. [PMID: 31153348 PMCID: PMC6909981 DOI: 10.1121/1.5103195] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 04/18/2019] [Accepted: 04/22/2019] [Indexed: 05/21/2023]
Abstract
Many developmental studies attribute reduction of acoustic variability to increasing motor control. However, linear prediction-based formant measurements are known to be biased toward the nearest harmonic of F0, especially at high F0s. Thus, the amount of reported formant variability generated by changes in F0 is unknown. Here, 470 000 vowels were synthesized, mimicking statistics reported in four developmental studies, to estimate the proportion of formant variability that can be attributed to F0 bias, as well as other formant measurement errors. Results showed that the F0-induced formant measurements errors are large and systematic, and cannot be eliminated by a large sample size.
Collapse
Affiliation(s)
- Wei-Rong Chen
- Haskins Laboratories, 300 George Street, New Haven, Connecticut 06511, , ,
| | - D H Whalen
- Haskins Laboratories, 300 George Street, New Haven, Connecticut 06511, , ,
| | - Christine H Shadle
- Haskins Laboratories, 300 George Street, New Haven, Connecticut 06511, , ,
| |
Collapse
|
16
|
Cortés JP, Espinoza VM, Ghassemi M, Mehta DD, Van Stan JH, Hillman RE, Guttag JV, Zañartu M. Ambulatory assessment of phonotraumatic vocal hyperfunction using glottal airflow measures estimated from neck-surface acceleration. PLoS One 2018; 13:e0209017. [PMID: 30571719 PMCID: PMC6301575 DOI: 10.1371/journal.pone.0209017] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Accepted: 11/27/2018] [Indexed: 12/21/2022] Open
Abstract
Phonotraumatic vocal hyperfunction (PVH) is associated with chronic misuse and/or abuse of voice that can result in lesions such as vocal fold nodules. The clinical aerodynamic assessment of vocal function has been recently shown to differentiate between patients with PVH and healthy controls to provide meaningful insight into pathophysiological mechanisms associated with these disorders. However, all current clinical assessment of PVH is incomplete because of its inability to objectively identify the type and extent of detrimental phonatory function that is associated with PVH during daily voice use. The current study sought to address this issue by incorporating, for the first time in a comprehensive ambulatory assessment, glottal airflow parameters estimated from a neck-mounted accelerometer and recorded to a smartphone-based voice monitor. We tested this approach on 48 patients with vocal fold nodules and 48 matched healthy-control subjects who each wore the voice monitor for a week. Seven glottal airflow features were estimated every 50 ms using an impedance-based inverse filtering scheme, and seven high-order summary statistics of each feature were computed every 5 minutes over voiced segments. Based on a univariate hypothesis testing, eight glottal airflow summary statistics were found to be statistically different between patient and healthy-control groups. L1-regularized logistic regression for a supervised classification task yielded a mean (standard deviation) area under the ROC curve of 0.82 (0.25) and an accuracy of 0.83 (0.14). These results outperform the state-of-the-art classification for the same classification task and provide a new avenue to improve the assessment and treatment of hyperfunctional voice disorders.
Collapse
Affiliation(s)
- Juan P. Cortés
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile
| | - Víctor M. Espinoza
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile
- Department of Sound, Universidad de Chile, Santiago, Chile
| | - Marzyeh Ghassemi
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, United States of America
| | - Daryush D. Mehta
- Center for Laryngeal Surgery and Voice Rehabilitation, and MGH Institute of Health Professions, Massachusetts General Hospital, Boston, MA, United States of America
- Department of Surgery, Harvard Medical School, Boston, MA, United States of America
| | - Jarrad H. Van Stan
- Center for Laryngeal Surgery and Voice Rehabilitation, and MGH Institute of Health Professions, Massachusetts General Hospital, Boston, MA, United States of America
- Department of Surgery, Harvard Medical School, Boston, MA, United States of America
| | - Robert E. Hillman
- Center for Laryngeal Surgery and Voice Rehabilitation, and MGH Institute of Health Professions, Massachusetts General Hospital, Boston, MA, United States of America
| | - John V. Guttag
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, United States of America
| | - Matías Zañartu
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile
| |
Collapse
|
17
|
Saba JN, Ali H, Hansen JHL. Formant priority channel selection for an " n-of- m" sound processing strategy for cochlear implants. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:3371. [PMID: 30599685 PMCID: PMC6296909 DOI: 10.1121/1.5080257] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Revised: 10/19/2018] [Accepted: 11/06/2018] [Indexed: 06/09/2023]
Abstract
The Advanced Combination Encoder (ACE) signal processing strategy is used in the majority of cochlear implant (CI) sound processors manufactured by Cochlear Corporation. This "n-of-m" strategy selects "n" out of "m" available frequency channels with the highest spectral energy in each stimulation cycle. It is hypothesized that at low signal-to-noise ratio (SNR) conditions, noise-dominant frequency channels are susceptible for selection, neglecting channels containing target speech cues. In order to improve speech segregation in noise, explicit encoding of formant frequency locations within the standard channel selection framework of ACE is suggested. Two strategies using the direct formant estimation algorithms are developed within this study, FACE (formant-ACE) and VFACE (voiced-activated-formant-ACE). Speech intelligibility from eight CI users is compared across 11 acoustic conditions, including mixtures of noise and reverberation at multiple SNRs. Significant intelligibility gains were observed with VFACE over ACE in 5 dB babble noise; however, results with FACE/VFACE in all other conditions were comparable to standard ACE. An increased selection of channels associated with the second formant frequency is observed for FACE and VFACE. Both proposed methods may serve as potential supplementary channel selection techniques for the ACE sound processing strategy for cochlear implants.
Collapse
Affiliation(s)
- Juliana N Saba
- Cochlear Implant Processing Laboratory-Center for Robust Speech Systems, University of Texas at Dallas, Richardson, 800 West Campbell Road, Richardson, Texas 75080, USA
| | - Hussnain Ali
- Cochlear Implant Processing Laboratory-Center for Robust Speech Systems, University of Texas at Dallas, Richardson, 800 West Campbell Road, Richardson, Texas 75080, USA
| | - John H L Hansen
- Cochlear Implant Processing Laboratory-Center for Robust Speech Systems, University of Texas at Dallas, Richardson, 800 West Campbell Road, Richardson, Texas 75080, USA
| |
Collapse
|
18
|
Kent RD, Vorperian HK. Static measurements of vowel formant frequencies and bandwidths: A review. JOURNAL OF COMMUNICATION DISORDERS 2018; 74:74-97. [PMID: 29891085 PMCID: PMC6002811 DOI: 10.1016/j.jcomdis.2018.05.004] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/28/2017] [Revised: 04/23/2018] [Accepted: 05/27/2018] [Indexed: 05/05/2023]
Abstract
PURPOSE Data on vowel formants have been derived primarily from static measures representing an assumed steady state. This review summarizes data on formant frequencies and bandwidths for American English and also addresses (a) sources of variability (focusing on speech sample and time sampling point), and (b) methods of data reduction such as vowel area and dispersion. METHOD Searches were conducted with CINAHL, Google Scholar, MEDLINE/PubMed, SCOPUS, and other online sources including legacy articles and references. The primary search items were vowels, vowel space area, vowel dispersion, formants, formant frequency, and formant bandwidth. RESULTS Data on formant frequencies and bandwidths are available for both sexes over the lifespan, but considerable variability in results across studies affects even features of the basic vowel quadrilateral. Origins of variability likely include differences in speech sample and time sampling point. The data reveal the emergence of sex differences by 4 years of age, maturational reductions in formant bandwidth, and decreased formant frequencies with advancing age in some persons. It appears that a combination of methods of data reduction provide for optimal data interpretation. CONCLUSION The lifespan database on vowel formants shows considerable variability within specific age-sex groups, pointing to the need for standardized procedures.
Collapse
Affiliation(s)
- Raymond D Kent
- Waisman Center, University of Wisconsin-Madison, United States.
| | | |
Collapse
|
19
|
Jacewicz E, Fox RA. Regional Variation in Fundamental Frequency of American English Vowels. PHONETICA 2018; 75:273-309. [PMID: 29649804 DOI: 10.1159/000484610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Accepted: 10/04/2017] [Indexed: 06/08/2023]
Abstract
We examined whether the fundamental frequency (f0) of vowels is influenced by regional variation, aiming to (1) establish how the relationship between vowel height and f0 ("intrinsic f0") is utilized in regional vowel systems and (2) determine whether regional varieties differ in their implementation of the effects of phonetic context on f0 variations. An extended set of acoustic measures explored f0 in vowels in isolated tokens (experiment 1) and in connected speech (experiment 2) from 36 women representing 3 different varieties of American English. Regional differences were found in f0 shape in isolated tokens, in the magnitude of intrinsic f0 difference between high and low vowels, in the nature of f0 contours in stressed vowels, and in the completion of f0 contours in the context of coda voicing. Regional varieties utilize f0 control in vowels in different ways, including regional f0 ranges and variation in f0 shape.
Collapse
|
20
|
Gowda D, Airaksinen M, Alku P. Quasi-closed phase forward-backward linear prediction analysis of speech for accurate formant detection and estimation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 142:1542. [PMID: 28964072 DOI: 10.1121/1.5001512] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Recently, a quasi-closed phase (QCP) analysis of speech signals for accurate glottal inverse filtering was proposed. However, the QCP analysis which belongs to the family of temporally weighted linear prediction (WLP) methods uses the conventional forward type of sample prediction. This may not be the best choice especially in computing WLP models with a hard-limiting weighting function. A sample selective minimization of the prediction error in WLP reduces the effective number of samples available within a given window frame. To counter this problem, a modified quasi-closed phase forward-backward (QCP-FB) analysis is proposed, wherein each sample is predicted based on its past as well as future samples thereby utilizing the available number of samples more effectively. Formant detection and estimation experiments on synthetic vowels generated using a physical modeling approach as well as natural speech utterances show that the proposed QCP-FB method yields statistically significant improvements over the conventional linear prediction and QCP methods.
Collapse
Affiliation(s)
- Dhananjaya Gowda
- Department of Signal Processing and Acoustics, Aalto University, Otakaari 5, FI-00076 Espoo, Finland
| | - Manu Airaksinen
- Department of Signal Processing and Acoustics, Aalto University, Otakaari 5, FI-00076 Espoo, Finland
| | - Paavo Alku
- Department of Signal Processing and Acoustics, Aalto University, Otakaari 5, FI-00076 Espoo, Finland
| |
Collapse
|
21
|
Alzamendi GA, Schlotthauer G. Modeling and joint estimation of glottal source and vocal tract filter by state-space methods. Biomed Signal Process Control 2017. [DOI: 10.1016/j.bspc.2016.12.022] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
22
|
Chien YR, Mehta DD, Guðnason J, Zañartu M, Quatieri TF. Evaluation of Glottal Inverse Filtering Algorithms Using a Physiologically Based Articulatory Speech Synthesizer. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2017; 25:1718-1730. [PMID: 34268444 PMCID: PMC8279087 DOI: 10.1109/taslp.2017.2714839] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Glottal inverse filtering aims to estimate the glottal airflow signal from a speech signal for applications such as speaker recognition and clinical voice assessment. Nonetheless, evaluation of inverse filtering algorithms has been challenging due to the practical difficulties of directly measuring glottal airflow. Apart from this, it is acknowledged that the performance of many methods degrade in voice conditions that are of great interest, such as breathiness, high pitch, soft voice, and running speech. This paper presents a comprehensive, objective, and comparative evaluation of state-of-the-art inverse filtering algorithms that takes advantage of speech and glottal airflow signals generated by a physiological speech synthesizer. The synthesizer provides a physics-based simulation of the voice production process and thus an adequate test bed for revealing the temporal and spectral performance characteristics of each algorithm. Included in the synthetic data are continuous speech utterances and sustained vowels, which are produced with multiple voice qualities (pressed, slightly pressed, modal, slightly breathy, and breathy), fundamental frequencies, and subglottal pressures to simulate the natural variations in real speech. In evaluating the accuracy of a glottal flow estimate, multiple error measures are used, including an error in the estimated signal that measures overall waveform deviation, as well as an error in each of several clinically relevant features extracted from the glottal flow estimate. Waveform errors calculated from glottal flow estimation experiments exhibited mean values around 30% for sustained vowels, and around 40% for continuous speech, of the amplitude of true glottal flow derivative. Closed-phase approaches showed remarkable stability across different voice qualities and subglottal pressures. The algorithms of choice, as suggested by significance tests, are closed-phase covariance analysis for the analysis of sustained vowels, and sparse linear prediction for the analysis of continuous speech. Results of data subset analysis suggest that analysis of close rounded vowels is an additional challenge in glottal flow estimation.
Collapse
Affiliation(s)
- Yu-Ren Chien
- Center for Analysis and Design of Intelligent Agents, Reykjavik University, Menntavegur 1, Iceland
| | - Daryush D Mehta
- Center for Laryngeal Surgery and Voice Rehabilitation, and Institute of Health Professions, Massachusetts General Hospital, Boston MA 02114 USA, with the Department of Surgery, Harvard Medical School, Boston, MA 02115 USA, and also with MIT Lincoln Laboratory, Lexington, MA
| | - Jón Guðnason
- Center for Analysis and Design of Intelligent Agents, Reykjavik University, Menntavegur 1, Iceland
| | - Matías Zañartu
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile, 2390123
| | | |
Collapse
|
23
|
Oohashi H, Watanabe H, Taga G. Acquisition of vowel articulation in childhood investigated by acoustic-to-articulatory inversion. Infant Behav Dev 2017; 46:178-193. [PMID: 28222332 DOI: 10.1016/j.infbeh.2017.01.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2015] [Revised: 11/28/2016] [Accepted: 01/31/2017] [Indexed: 11/30/2022]
Abstract
While the acoustical features of speech sounds in children have been extensively studied, limited information is available as to their articulation during speech production. Instead of directly measuring articulatory movements, this study used an acoustic-to-articulatory inversion model with scalable vocal tract size to estimate developmental changes in articulatory state during vowel production. Using a pseudo-inverse Jacobian matrix of a model mapping seven articulatory parameters to acoustic ones, the formant frequencies of each vowel produced by three Japanese children over time at ages between 6 and 60 months were transformed into articulatory parameters. We conducted the discriminant analysis to reveal differences in articulatory states for production of each vowel. The analysis suggested that development of vowel production went through gradual functionalization of articulatory parameters. At 6-9 months, the coordination of position of tongue body and lip aperture forms three vowels: front, back, and central. At 10-17 months, recruitments of jaw and tongue apex enable differentiation of these three vowels into five. At 18 months and older, recruitment of tongue shape produces more distinct vowels specific to Japanese. These results suggest that the jaw and tongue apex contributed to speech production by young children regardless of kinds of vowel. Moreover, initial articulatory states for each vowel could be distinguished by the manner of coordination between lip and tongue, and these initial states are differentiated and refined into articulations adjusted to the native language over the course of development.
Collapse
Affiliation(s)
- Hiroki Oohashi
- Graduate School of Education, The University of Tokyo, Bunkyo-ku 7-3-1, 113-0033 Tokyo, Japan.
| | - Hama Watanabe
- Graduate School of Education, The University of Tokyo, Bunkyo-ku 7-3-1, 113-0033 Tokyo, Japan
| | - Gentaro Taga
- Graduate School of Education, The University of Tokyo, Bunkyo-ku 7-3-1, 113-0033 Tokyo, Japan
| |
Collapse
|
24
|
Using Innovative Acoustic Analysis to Predict the Postoperative Outcomes of Unilateral Vocal Fold Paralysis. BIOMED RESEARCH INTERNATIONAL 2016; 2016:7821415. [PMID: 27738634 PMCID: PMC5050388 DOI: 10.1155/2016/7821415] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Revised: 07/26/2016] [Accepted: 08/23/2016] [Indexed: 11/30/2022]
Abstract
Objective. Autologous fat injection laryngoplasty is ineffective for some patients with iatrogenic vocal fold paralysis, and additional laryngeal framework surgery is often required. An acoustically measurable outcome predictor for lipoinjection laryngoplasty would assist phonosurgeons in formulating treatment strategies. Methods. Seventeen thyroid surgery patients with unilateral vocal fold paralysis participated in this study. All subjects underwent lipoinjection laryngoplasty to treat postsurgery vocal hoarseness. After treatment, patients were assigned to success and failure groups on the basis of voice improvement. Linear prediction analysis was used to construct a new voice quality indicator, the number of irregular peaks (NIrrP). It compared with the measures used in the Multi-Dimensional Voice Program (MDVP), such as jitter (frequency perturbation) and shimmer (perturbation of amplitude). Results. By comparing the [i] vowel produced by patients before the lipoinjection laryngoplasty (AUC = 0.98, 95% CI = 0.78–0.99), NIrrP was shown to be a more accurate predictor of long-term surgical outcomes than jitter (AUC = 0.73, 95% CI = 0.47–0.91) and shimmer (AUC = 0.63, 95% CI = 0.37–0.85), as identified by the receiver operating characteristic curve. Conclusions. NIrrP measured using the LP model could be a more accurate outcome predictor than the parameters used in the MDVP.
Collapse
|
25
|
Story BH, Bunton K. Formant measurement in children's speech based on spectral filtering. SPEECH COMMUNICATION 2016; 76:93-111. [PMID: 26855461 PMCID: PMC4743040 DOI: 10.1016/j.specom.2015.11.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
Children's speech presents a challenging problem for formant frequency measurement. In part, this is because high fundamental frequencies, typical of a children's speech production, generate widely spaced harmonic components that may undersample the spectral shape of the vocal tract transfer function. In addition, there is often a weakening of upper harmonic energy and a noise component due to glottal turbulence. The purpose of this study was to develop a formant measurement technique based on cepstral analysis that does not require modification of the cepstrum itself or transformation back to the spectral domain. Instead, a narrow-band spectrum is low-pass filtered with a cutoff point (i.e., cutoff "quefrency" in the terminology of cepstral analysis) to preserve only the spectral envelope. To test the method, speech representative of a 2-3 year-old child was simulated with an airway modulation model of speech production. The model, which includes physiologically-scaled vocal folds and vocal tract, generates sound output analogous to a microphone signal. The vocal tract resonance frequencies can be calculated independently of the output signal and thus provide test cases that allow for assessing the accuracy of the formant tracking algorithm. When applied to the simulated child-like speech, the spectral filtering approach was shown to provide a clear spectrographic representation of formant change over the time course of the signal, and facilitates tracking formant frequencies for further analysis.
Collapse
Affiliation(s)
- Brad H. Story
- Speech Acoustics Laboratory, Department of Speech, Language, and Hearing Sciences, University of Arizona, P.O. Box 210071, Tucson, AZ 85721
| | - Kate Bunton
- Speech Acoustics Laboratory, Department of Speech, Language, and Hearing Sciences, University of Arizona, P.O. Box 210071, Tucson, AZ 85721
| |
Collapse
|
26
|
Shadle CH, Nam H, Whalen DH. Comparing measurement errors for formants in synthetic and natural vowels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 139:713-27. [PMID: 26936555 PMCID: PMC4752539 DOI: 10.1121/1.4940665] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Revised: 12/04/2015] [Accepted: 01/08/2016] [Indexed: 05/21/2023]
Abstract
The measurement of formant frequencies of vowels is among the most common measurements in speech studies, but measurements are known to be biased by the particular fundamental frequency (F0) exciting the formants. Approaches to reducing the errors were assessed in two experiments. In the first, synthetic vowels were constructed with five different first formant (F1) values and nine different F0 values; formant bandwidths, and higher formant frequencies, were constant. Input formant values were compared to manual measurements and automatic measures using the linear prediction coding-Burg algorithm, linear prediction closed-phase covariance, the weighted linear prediction-attenuated main excitation (WLP-AME) algorithm [Alku, Pohjalainen, Vainio, Laukkanen, and Story (2013). J. Acoust. Soc. Am. 134(2), 1295-1313], spectra smoothed cepstrally and by averaging repeated discrete Fourier transforms. Formants were also measured manually from pruned reassigned spectrograms (RSs) [Fulop (2011). Speech Spectrum Analysis (Springer, Berlin)]. All but WLP-AME and RS had large errors in the direction of the strongest harmonic; the smallest errors occur with WLP-AME and RS. In the second experiment, these methods were used on vowels in isolated words spoken by four speakers. Results for the natural speech show that F0 bias affects all automatic methods, including WLP-AME; only the formants measured manually from RS appeared to be accurate. In addition, RS coped better with weaker formants and glottal fry.
Collapse
Affiliation(s)
- Christine H Shadle
- Haskins Laboratories, 300 George Street, New Haven, Connecticut 06511, USA
| | - Hosung Nam
- Haskins Laboratories, 300 George Street, New Haven, Connecticut 06511, USA
| | - D H Whalen
- Haskins Laboratories, 300 George Street, New Haven, Connecticut 06511, USA
| |
Collapse
|
27
|
Montgomery A, Reed PE, Crass KA, Hubbard HI, Stith J. The effects of measurement error and vowel selection on the locus equation measure of coarticulation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2014; 136:2747-2750. [PMID: 25373974 DOI: 10.1121/1.4896460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Effects on the slope of introducing error in the F2 Hz values in locus equations (LEs) and of using fewer than ten vowels were investigated. For each of the initial consonants /b, d, g/, 2000 simulated sets were generated using Monte Carlo techniques. The sets were altered with 50, 100, or 200 Hz error being randomly applied to each F2 value within a set. Selected vowels were then removed from the sets and the effects on the slopes were measured. Results suggest that the LE slopes are generally resistant to error and reduced number of vowels. Effects of adding 50 Hz of random error to the F2 values in sets using eight or ten vowels were minimal, yielding a mean absolute change in slope less than 0.07.
Collapse
Affiliation(s)
- Allen Montgomery
- Department of Communication Sciences and Disorders, University of South Carolina, Columbia, South Carolina 29208
| | - Paul E Reed
- Linguistics Program, University of South Carolina, Columbia, South Carolina 29208
| | - Kimberlee A Crass
- Department of Communication Sciences and Disorders, University of South Carolina, Columbia, South Carolina 29208
| | - H Isabel Hubbard
- Department of Communication Sciences and Disorders, University of South Carolina, Columbia, South Carolina 29208
| | - Joanna Stith
- Listen Foundation, Greenwood Village, Colorado 80111
| |
Collapse
|