1
|
Mahmood SA. Multi-Dimensional Features Extraction for Voice Pathology Detection Based on Deep Learning Methods. J Voice 2025:S0892-1997(24)00486-7. [PMID: 39894721 DOI: 10.1016/j.jvoice.2024.12.048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2024] [Accepted: 12/30/2024] [Indexed: 02/04/2025]
Abstract
PURPOSE Voice pathology detection is a rapidly evolving field of scientific research focused on the identification and diagnosis of voice disorders. Early detection and diagnosis of these disorders is critical, as it increases the likelihood of effective treatment and reduces the burden on medical professionals. METHODS The objective of this scientific paper is to develop a comprehensive model that utilizes various deep learning techniques to improve the detection of voice pathology. To achieve this, the paper employs several techniques to extract a set of sensitive features from the original voice signal by analyzing the time-frequency characteristics of the signal. In this regard, as a means of extracting these features, a state-of-the-art approach combining Gammatonegram features with Scalogram Teager_Kaiser Energy Operator (TKEO) features is proposed, and the proposed feature extraction scheme is named Combine Gammatonegram with (TKEO) Scalogram (CGT Scalogram). In this study, ResNet deep learning is used to recognize healthy voices from pathological voices. To evaluate the performance of the proposed model, it is trained and tested using the Saarbrucken voice database. RESULTS In the end, the proposed system yielded impressive results with an accuracy of 96%, a precision of 96.3%, and a recall of 96.1% for binary classification and an accuracy of 94.4%, a precision of 94.5%, and a recall of 94% for multi-class. CONCLUSION The results of the experiments demonstrate the effectiveness of the feature selection technique in maximizing the prediction accuracy in both binary and multi-class classifications.
Collapse
Affiliation(s)
- Sozan Abdullah Mahmood
- Computer Department, College of Science, University of Sulaimani, Sulaimaniyah 46001, Kurdistan, Iraq.
| |
Collapse
|
2
|
Schraut T, Döllinger M, Kunduk M, Echternach M, Dürr S, Werz J, Schützenberger A. Machine Learning-Based Estimation of Hoarseness Severity Using Acoustic Signals Recorded During High-Speed Videoendoscopy. J Voice 2025:S0892-1997(24)00437-5. [PMID: 39755525 DOI: 10.1016/j.jvoice.2024.12.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 12/02/2024] [Accepted: 12/04/2024] [Indexed: 01/06/2025]
Abstract
OBJECTIVES This study investigates the use of sustained phonations recorded during high-speed videoendoscopy (HSV) for machine learning-based assessment of hoarseness severity (H). The performance of this approach is compared with conventional recordings obtained during voice therapy to evaluate key differences and limitations of HSV-derived acoustic recordings. METHODS A database of 617 voice recordings with a duration of 250 ms was gathered during HSV examination (HS). Two databases comprising 809 vowels recorded during voice therapy were used for comparison, examining recording durations of 1 second (VT-1) and 250 ms (VT-2). A total of 490 features were extracted, including perturbation and noise characteristics, spectral and cepstral coefficients, as well as features based on modulation spectrum, nonlinear dynamic analysis, entropy, and empirical mode decomposition. Model development focused on selecting a minimal-optimal feature subset and suitable classification algorithms. Recordings were classified into two groups of hoarseness based on auditory-perceptual ratings by experts, yielding a continuous hoarseness score yˆ. Model performance was evaluated based on classification accuracy, correlation between predicted scores yˆ∈[0,1] and subjective ratings H∈{0,1,2,3}, and correlation between the relative change in quantitative and subjective ratings. RESULTS Logistic regression combined with five acoustic features achieved a classification accuracy of 0.863 (VT-1), 0.847 (VT-2), and 0.742 (HS) on the test sets. A correlation of 0.797 (VT-1), 0.763 (VT-2), and 0.637 (HS) was obtained between yˆ and H, respectively. For 21 test subjects with two recordings, the model yielded a correlation of 0.592 (VT-1), 0.486 (VT-2), and 0.088 (HS) between ∆yˆ and ∆H. CONCLUSION While acoustic signals recorded during HSV show potential for quantitative hoarseness assessment, they are less reliable than voice therapy recordings due to practical challenges associated with oral laryngeal examination. Addressing these limitations, for example, through the use of flexible nasal endoscopy, could improve the quality of HSV-derived acoustic recordings and voice assessments.
Collapse
Affiliation(s)
- Tobias Schraut
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany.
| | - Michael Döllinger
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| | - Melda Kunduk
- Department of Communication Sciences and Disorders, Louisiana State University, Baton Rouge, LA 70803
| | - Matthias Echternach
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Munich, Ludwig-Maximilian-Universität München, 81377 Munich, Germany
| | - Stephan Dürr
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, University Hospital Regensburg, Universität Regensburg, 93053 Regensburg, Germany
| | - Julia Werz
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| | - Anne Schützenberger
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| |
Collapse
|
3
|
Barlow J, Sragi Z, Rivera-Rivera G, Al-Awady A, Daşdöğen Ü, Courey MS, Kirke DN. The Use of Deep Learning Software in the Detection of Voice Disorders: A Systematic Review. Otolaryngol Head Neck Surg 2024; 170:1531-1543. [PMID: 38168017 DOI: 10.1002/ohn.636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 11/30/2023] [Accepted: 12/07/2023] [Indexed: 01/05/2024]
Abstract
OBJECTIVE To summarize the use of deep learning in the detection of voice disorders using acoustic and laryngoscopic input, compare specific neural networks in terms of accuracy, and assess their effectiveness compared to expert clinical visual examination. DATA SOURCES Embase, MEDLINE, and Cochrane Central. REVIEW METHODS Databases were screened through November 11, 2023 for relevant studies. The inclusion criteria required studies to utilize a specified deep learning method, use laryngoscopy or acoustic input, and measure accuracy of binary classification between healthy patients and those with voice disorders. RESULTS Thirty-four studies met the inclusion criteria, with 18 focusing on voice analysis, 15 on imaging analysis, and 1 both. Across the 18 acoustic studies, 21 programs were used for identification of organic and functional voice disorders. These technologies included 10 convolutional neural networks (CNNs), 6 multilayer perceptrons (MLPs), and 5 other neural networks. The binary classification systems yielded a mean accuracy of 89.0% overall, including 93.7% for MLP programs and 84.5% for CNNs. Among the 15 imaging analysis studies, a total of 23 programs were utilized, resulting in a mean accuracy of 91.3%. Specifically, the twenty CNNs achieved a mean accuracy of 92.6% compared to 83.0% for the 3 MLPs. CONCLUSION Deep learning models were shown to be highly accurate in the detection of voice pathology, with CNNs most effective for assessing laryngoscopy images and MLPs most effective for assessing acoustic input. While deep learning methods outperformed expert clinical exam in limited comparisons, further studies integrating external validation are necessary.
Collapse
Affiliation(s)
- Joshua Barlow
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Zara Sragi
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Gabriel Rivera-Rivera
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Abdurrahman Al-Awady
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Ümit Daşdöğen
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Mark S Courey
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Diana N Kirke
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| |
Collapse
|
4
|
Schraut T, Schützenberger A, Arias-Vergara T, Kunduk M, Echternach M, Döllinger M. Machine learning based estimation of hoarseness severity using sustained vowelsa). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:381-395. [PMID: 38240668 DOI: 10.1121/10.0024341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 12/18/2023] [Indexed: 01/23/2024]
Abstract
Auditory perceptual evaluation is considered the gold standard for assessing voice quality, but its reliability is limited due to inter-rater variability and coarse rating scales. This study investigates a continuous, objective approach to evaluate hoarseness severity combining machine learning (ML) and sustained phonation. For this purpose, 635 acoustic recordings of the sustained vowel /a/ and subjective ratings based on the roughness, breathiness, and hoarseness scale were collected from 595 subjects. A total of 50 temporal, spectral, and cepstral features were extracted from each recording and used to identify suitable ML algorithms. Using variance and correlation analysis followed by backward elimination, a subset of relevant features was selected. Recordings were classified into two levels of hoarseness, H<2 and H≥2, yielding a continuous probability score ŷ∈[0,1]. An accuracy of 0.867 and a correlation of 0.805 between the model's predictions and subjective ratings was obtained using only five acoustic features and logistic regression (LR). Further examination of recordings pre- and post-treatment revealed high qualitative agreement with the change in subjectively determined hoarseness levels. Quantitatively, a moderate correlation of 0.567 was obtained. This quantitative approach to hoarseness severity estimation shows promising results and potential for improving the assessment of voice quality.
Collapse
Affiliation(s)
- Tobias Schraut
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| | - Anne Schützenberger
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| | - Tomás Arias-Vergara
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| | - Melda Kunduk
- Department of Communication Sciences and Disorders, Louisiana State University, Baton Rouge, Louisiana 70803, USA
| | - Matthias Echternach
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Munich, Ludwig-Maximilians-Universität München, 81377 Munich, Germany
| | - Michael Döllinger
- Division of Phoniatrics and Pediatric Audiology at the Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| |
Collapse
|
5
|
Shaikh AAS, Bhargavi MS, Naik GR. Unraveling the complexities of pathological voice through saliency analysis. Comput Biol Med 2023; 166:107566. [PMID: 37857135 DOI: 10.1016/j.compbiomed.2023.107566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 09/14/2023] [Accepted: 10/10/2023] [Indexed: 10/21/2023]
Abstract
The human voice is an essential communication tool, but various disorders and habits can disrupt it. Diagnosis of pathological and abnormal voices is very important. Conventional diagnosis of these voice pathologies can be invasive and costly. Voice pathology disorders can be effectively detected using Artificial Intelligence and computer-aided voice pathology classification tools. Previous studies focused primarily on binary classification, leaving limited attention to multi-class classification. This study proposes three different neural network architectures to investigate the feature characteristics of three voice pathologies-Hyperkinetic Dysphonia, Hypokinetic Dysphonia, Reflux Laryngitis, and healthy voices using multi-class classification and the Voice ICar fEDerico II (VOICED) dataset. The study proposes UNet++ autoencoder-based denoiser techniques for accurate feature extraction to overcome noisy data. The architectures include a Multi-Layer Perceptron (MLP) trained on structured feature sets, a Short-Time Fourier Transform (STFT) model, and a Mel-Frequency Cepstral Coefficients (MFCC) model. The MLP model on 143 features achieved 97.1% accuracy, while the STFT model showed similar performance with increased sensitivity of 99.8%. The MFCC model maintained 97.1% accuracy but with a smaller model size and improved accuracy on the Reflux Laryngitis class. The study identifies crucial features through saliency analysis and reveals that detecting voice abnormalities requires the identification of regions of inaudible high-pitch sounds. Additionally, the study highlights the challenges posed by limited and disjointed pathological voice databases and proposes solutions for enhancing the performance of voice abnormality classification. Overall, the study's findings have potential applications in clinical applications and specialized audio-capturing tools.
Collapse
Affiliation(s)
- Abdullah Abdul Sattar Shaikh
- Department of Computer Science and Engineering, Bangalore Institute of Technology, Bangalore, 560004, Karnataka, India.
| | - M S Bhargavi
- Department of Computer Science and Engineering, Bangalore Institute of Technology, Bangalore, 560004, Karnataka, India.
| | - Ganesh R Naik
- Adelaide Institute for Sleep Health, Flinders University, Bedford Park 5042, Adelaide, SA, Australia.
| |
Collapse
|