1
|
Rahman MU, Direkoglu C. A hybrid approach for binary and multi-class classification of voice disorders using a pre-trained model and ensemble classifiers. BMC Med Inform Decis Mak 2025; 25:177. [PMID: 40312383 PMCID: PMC12044829 DOI: 10.1186/s12911-025-02978-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Accepted: 03/18/2025] [Indexed: 05/03/2025] Open
Abstract
Recent advances in artificial intelligence-based audio and speech processing have increasingly focused on the binary and multi-class classification of voice disorders. Despite progress, achieving high accuracy in multi-class classification remains challenging. This paper proposes a novel hybrid approach using a two-stage framework to enhance voice disorders classification performance, and achieve state-of-the-art accuracies in multi-class classification. Our hybrid approach, combines deep learning features with various powerful classifiers. In the first stage, high-level feature embeddings are extracted from voice data spectrograms using a pre-trained VGGish model. In the second stage, these embeddings are used as input to four different classifiers: Support Vector Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and an Ensemble Classifier (EC). Experiments are conducted on a subset of the Saarbruecken Voice Database (SVD) for male, female, and combined speakers. For binary classification, VGGish-SVM achieved the highest accuracy for male speakers (82.45% for healthy vs. disordered; 75.45% for hyperfunctional dysphonia vs. vocal fold paresis), while VGGish-EC performed best for female speakers (71.54% for healthy vs. disordered; 68.42% for hyperfunctional dysphonia vs. vocal fold paresis). In multi-class classification, VGGish-SVM outperformed other models, achieving mean accuracies of 77.81% for male speakers, 63.11% for female speakers, and 70.53% for combined genders. We conducted a comparative analysis against related works, including the Mel frequency cepstral coefficient (MFCC), MFCC-glottal features, and features extracted using the wav2vec and HuBERT models with SVM classifier. Results demonstrate that our hybrid approach consistently outperforms these models, especially in multi-class classification tasks. The results show the feasibility of a hybrid framework for voice disorder classification, offering a foundation for refining automated tools that could support clinical assessments with further validation.
Collapse
Affiliation(s)
- Mehtab Ur Rahman
- Department of Language and Communication, Radboud University, Houtlaan, Nijmegen, Gelderland, 6525, Netherlands.
- Electrical and Electronics Engineering Department, Middle East Technical University, Northern Cyprus Campus, Kalkanli, Güzelyurt, Mersin 10, 99738, Turkey.
| | - Cem Direkoglu
- Electrical and Electronics Engineering Department, Middle East Technical University, Northern Cyprus Campus, Kalkanli, Güzelyurt, Mersin 10, 99738, Turkey
| |
Collapse
|
2
|
Aljarallah NA, Dutta AK, Sait ARW. Image classification-driven speech disorder detection using deep learning technique. SLAS Technol 2025; 32:100261. [PMID: 40057233 DOI: 10.1016/j.slast.2025.100261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2024] [Revised: 01/27/2025] [Accepted: 02/28/2025] [Indexed: 03/16/2025]
Abstract
Speech disorders affect an individual's ability to generate sounds or utilize the voice appropriately. Neurological, developmental, physical, and trauma may cause speech disorders. Speech impairments influence communication, social interaction, education, and quality of life. Successful intervention entails early and precise diagnosis to allow for prompt treatment of these conditions. However, clinical examinations by speech-language pathologists are time-consuming, subjective, and demand an automated speech disorder detection (SDD) model. Mel-spectrogram images present a visual representation of multiple speech disorders. By classifying Mel-Spectrogram, various speech disorders can be identified. In this study, the authors proposed an image classification-based automated SDD model to classify Mel-Spectrograms to identify multiple speech disorders. Initially, Wavelet Transform (WT) hybridization technique was employed to generate Mel-Spectrogram using the voice samples. A feature extraction approach was developed using an enhanced LEVIT transformer. Finally, the extracted features were classified using an ensemble learning (EL) approach, containing CatBoost and XGBoost as base learners, and Extremely Randomized Tree as a meta learner. To reduce the computational resources, the authors used quantization-aware training (QAT). They employed Shapley Additive Explanations (SHAP) values to offer model interpretability. The proposed model was generalized using Voice ICar fEDerico II (VOICED) and LANNA datasets. The exceptional accuracy of 99.1 with limited parameters of 8.2 million demonstrated the significance of the proposed approach. The proposed model enhances speech disorder classification and offers novel prospects for building accessible, accurate, and efficient diagnostic tools. Researchers may integrate multimodal data to increase the model's use across languages and dialects, refining the proposed model for real-time clinical and telehealth deployment.
Collapse
Affiliation(s)
- Nasser Ali Aljarallah
- Department of Computer Science and Information Systems, College of Applied Sciences, AlMaarefa University, Ad Diriyah, Riyadh, 13713, Saudi Arabia
| | - Ashit Kumar Dutta
- Department of Computer Science and Information Systems, College of Applied Sciences, AlMaarefa University, Ad Diriyah, Riyadh, 13713, Saudi Arabia.
| | - Abdul Rahaman Wahab Sait
- Department of Documents and Archive, Center of Documents and Administrative Communication, King Faisal University, Al Hofuf, 31982, Al-Ahsa, Saudi Arabia
| |
Collapse
|
3
|
Gulsen P, Gulsen A, Alci M. Machine Learning Models With Hyperparameter Optimization for Voice Pathology Classification on Saarbrücken Voice Database. J Voice 2025:S0892-1997(24)00438-7. [PMID: 39779407 DOI: 10.1016/j.jvoice.2024.12.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 11/05/2024] [Accepted: 12/04/2024] [Indexed: 01/11/2025]
Abstract
Early diagnosis and referral are crucial in the treatment of voice disorders. Contemporary investigations have indicated the efficacy of voice pathology detection systems in significantly contributing to the evaluation of voice disorders, facilitating early diagnosis of such pathologies. These systems leverage machine learning methodologies, widely applied across diverse domains, and exhibit particular potential in the realm of voice pathology classification. However, machine learning models and performance metrics employed in these studies vary significantly, making it challenging to determine the optimal model for voice pathology classification. In this study, healthy and pathological voices were classified with state-of-the-art machine learning models, and the performance results of the models were compared. The voice samples employed in our research were sourced from the Saarbrücken Voice Database, a reputable German database. Feature extraction from voice signals was conducted using the Mel Frequency Cepstral Coefficients method. To assess and enhance the models' performance adequately, we employed hyperparameter optimization and implemented a 10-fold cross-validation approach. The outcomes revealed that the support vector machine model exhibited the highest accuracy, achieving 99.19% and 99.50% accuracies in the classification of male and female voice pathologies, respectively.
Collapse
Affiliation(s)
- Pervin Gulsen
- Department of Electrical-Electronics Engineering, Erciyes University, Kayseri, Turkey.
| | - Abdulkadir Gulsen
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey.
| | - Mustafa Alci
- Department of Electrical-Electronics Engineering, Erciyes University, Kayseri, Turkey.
| |
Collapse
|
4
|
Wang CT, Chen TM, Lee NT, Fang SH. AI Detection of Glottic Neoplasm Using Voice Signals, Demographics, and Structured Medical Records. Laryngoscope 2024; 134:4585-4592. [PMID: 38864282 DOI: 10.1002/lary.31563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 04/16/2024] [Accepted: 05/21/2024] [Indexed: 06/13/2024]
Abstract
OBJECTIVE This study investigated whether artificial intelligence (AI) models combining voice signals, demographics, and structured medical records can detect glottic neoplasm from benign voice disorders. METHODS We used a primary dataset containing 2-3 s of vowel "ah", demographics, and 26 items of structured medical records (e.g., symptoms, comorbidity, smoking and alcohol consumption, vocal demand) from 60 patients with pathology-proved glottic neoplasm (i.e., squamous cell carcinoma, carcinoma in situ, and dysplasia) and 1940 patients with benign voice disorders. The validation dataset comprised data from 23 patients with glottic neoplasm and 1331 patients with benign disorders. The AI model combined convolutional neural networks, gated recurrent units, and attention layers. We used 10-fold cross-validation (training-validation-testing: 8-1-1) and preserved the percentage between neoplasm and benign disorders in each fold. RESULTS Results from the AI model using voice signals reached an area under the ROC curve (AUC) value of 0.631, and additional demographics increased this to 0.807. The highest AUC of 0.878 was achieved when combining voice, demographics, and medical records (sensitivity: 0.783, specificity: 0.816, accuracy: 0.815). External validation yielded an AUC value of 0.785 (voice plus demographics; sensitivity: 0.739, specificity: 0.745, accuracy: 0.745). Subanalysis showed that AI had higher sensitivity but lower specificity than human assessment (p < 0.01). The accuracy of AI detection with additional medical records was comparable with human assessment (82% vs. 83%, p = 0.78). CONCLUSIONS Voice signal alone was insufficient for AI differentiation between glottic neoplasm and benign voice disorders, but additional demographics and medical records notably improved AI performance and approximated the prediction accuracy of humans. LEVEL OF EVIDENCE NA Laryngoscope, 134:4585-4592, 2024.
Collapse
Affiliation(s)
- Chi-Te Wang
- Department of Otolaryngology Head and Neck Surgery, Far Eastern Memorial Hospital, Taipei, Taiwan
- Center of Artificial Intelligence, Far Eastern Memorial Hospital, Taipei, Taiwan
- Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
| | - Tsai-Min Chen
- Graduate Program of Data Science, National Taiwan University and Academia Sinica, Taipei, Taiwan
- Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan
| | - Nien-Ting Lee
- Center of Artificial Intelligence, Far Eastern Memorial Hospital, Taipei, Taiwan
| | - Shih-Hau Fang
- Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
- Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan
| |
Collapse
|
5
|
Maskeliūnas R, Damaševičius R, Kulikajevas A, Pribuišis K, Uloza V. Alaryngeal Speech Enhancement for Noisy Environments Using a Pareto Denoising Gated LSTM. J Voice 2024:S0892-1997(24)00228-5. [PMID: 39107213 DOI: 10.1016/j.jvoice.2024.07.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 07/13/2024] [Accepted: 07/15/2024] [Indexed: 08/09/2024]
Abstract
Loss of the larynx significantly alters natural voice production, requiring alternative communication modalities and rehabilitation methods to restore speech intelligibility and improve the quality of life of affected individuals. This paper explores advances in alaryngeal speech enhancement to improve signal quality and reduce background noise, focusing on individuals who have undergone laryngectomy. In this study, speech samples were obtained from 23 Lithuanian males who had undergone laryngectomy with secondary implantation of the tracheoesophageal prosthesis (TEP). Pareto-optimized gated long short-term memory was trained on tracheoesophageal speech data to recognize complex temporal connections and contextual information in speech signals. The system was able to distinguish between actual speech and various forms of noise and artifacts, resulting in a 25% drop in the mean signal-to-noise ratio compared to other approaches. According to acoustic analysis, the system significantly decreased the number of unvoiced frames (proportion of voiced frames) from 40% to 10% while maintaining stable proportions of voiced frames (proportion of voiced speech frames) and average voicing evidence (average voice evidence in voiced frames), indicating the accuracy of the approach in selectively attenuating noise and undesired speech artifacts while preserving important speech information.
Collapse
Affiliation(s)
- Rytis Maskeliūnas
- Centre of Real Time Computer Systems, Kaunas University of Technology, Kaunas, Lithuania.
| | - Robertas Damaševičius
- Centre of Real Time Computer Systems, Kaunas University of Technology, Kaunas, Lithuania
| | - Audrius Kulikajevas
- Centre of Real Time Computer Systems, Kaunas University of Technology, Kaunas, Lithuania
| | - Kipras Pribuišis
- Department of Otolaryngology, Lithuanian University of Health Sciences, Kaunas, Lithuania
| | - Virgilijus Uloza
- Department of Otolaryngology, Lithuanian University of Health Sciences, Kaunas, Lithuania
| |
Collapse
|
6
|
Shih DH, Liao CH, Wu TW, Xu XY, Shih MH. Dysarthria Speech Detection Using Convolutional Neural Networks with Gated Recurrent Unit. Healthcare (Basel) 2022; 10:healthcare10101956. [PMID: 36292403 PMCID: PMC9602047 DOI: 10.3390/healthcare10101956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 10/02/2022] [Accepted: 10/05/2022] [Indexed: 11/04/2022] Open
Abstract
In recent years, due to the rise in the population and aging, the prevalence of neurological diseases is also increasing year by year. Among these patients with Parkinson’s disease, stroke, cerebral palsy, and other neurological symptoms, dysarthria often appears. If these dysarthria patients are not quickly detected and treated, it is easy to cause difficulties in disease course management. When the symptoms worsen, they can also affect the patient’s psychology and physiology. Most of the past studies on dysarthria detection used machine learning or deep learning models as classification models. This study proposes an integrated CNN-GRU model with convolutional neural networks and gated recurrent units to detect dysarthria. The experimental results show that the CNN-GRU model proposed in this study has the highest accuracy of 98.38%, which is superior to other research models.
Collapse
Affiliation(s)
- Dong-Her Shih
- Department of Information Management, National Yunlin University of Science and Technology, Douliu 64002, Taiwan
| | - Ching-Hsien Liao
- Department of Information Management, National Yunlin University of Science and Technology, Douliu 64002, Taiwan
| | - Ting-Wei Wu
- Department of Information Management, National Yunlin University of Science and Technology, Douliu 64002, Taiwan
- Correspondence:
| | - Xiao-Yin Xu
- Department of Information Management, National Yunlin University of Science and Technology, Douliu 64002, Taiwan
| | - Ming-Hung Shih
- Department of Electrical and Computer Engineering, Iowa State University, 2520 Osborn Drive, Ames, IA 50011, USA
| |
Collapse
|
7
|
Verde L, Campanile L, Marulli F, Marrone S. Speech-based Evaluation of Emotions-Depression Correlation. 2022 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH) 2022:1-6. [DOI: 10.1109/dasc/picom/cbdcom/cy55231.2022.9927758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Affiliation(s)
- Laura Verde
- Universitá Della Campania "Luigi Vanvitelli",Dept. of Mathematics and Physics,Caserta,Italy
| | - Lelio Campanile
- Universitá Della Campania "Luigi Vanvitelli",Dept. of Mathematics and Physics,Caserta,Italy
| | - Fiammetta Marulli
- Universitá Della Campania "Luigi Vanvitelli",Dept. of Mathematics and Physics,Caserta,Italy
| | - Stefano Marrone
- Universitá Della Campania "Luigi Vanvitelli",Dept. of Mathematics and Physics,Caserta,Italy
| |
Collapse
|