1
|
Rahman MU, Direkoglu C. A hybrid approach for binary and multi-class classification of voice disorders using a pre-trained model and ensemble classifiers. BMC Med Inform Decis Mak 2025; 25:177. [PMID: 40312383 PMCID: PMC12044829 DOI: 10.1186/s12911-025-02978-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Accepted: 03/18/2025] [Indexed: 05/03/2025] Open
Abstract
Recent advances in artificial intelligence-based audio and speech processing have increasingly focused on the binary and multi-class classification of voice disorders. Despite progress, achieving high accuracy in multi-class classification remains challenging. This paper proposes a novel hybrid approach using a two-stage framework to enhance voice disorders classification performance, and achieve state-of-the-art accuracies in multi-class classification. Our hybrid approach, combines deep learning features with various powerful classifiers. In the first stage, high-level feature embeddings are extracted from voice data spectrograms using a pre-trained VGGish model. In the second stage, these embeddings are used as input to four different classifiers: Support Vector Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and an Ensemble Classifier (EC). Experiments are conducted on a subset of the Saarbruecken Voice Database (SVD) for male, female, and combined speakers. For binary classification, VGGish-SVM achieved the highest accuracy for male speakers (82.45% for healthy vs. disordered; 75.45% for hyperfunctional dysphonia vs. vocal fold paresis), while VGGish-EC performed best for female speakers (71.54% for healthy vs. disordered; 68.42% for hyperfunctional dysphonia vs. vocal fold paresis). In multi-class classification, VGGish-SVM outperformed other models, achieving mean accuracies of 77.81% for male speakers, 63.11% for female speakers, and 70.53% for combined genders. We conducted a comparative analysis against related works, including the Mel frequency cepstral coefficient (MFCC), MFCC-glottal features, and features extracted using the wav2vec and HuBERT models with SVM classifier. Results demonstrate that our hybrid approach consistently outperforms these models, especially in multi-class classification tasks. The results show the feasibility of a hybrid framework for voice disorder classification, offering a foundation for refining automated tools that could support clinical assessments with further validation.
Collapse
Affiliation(s)
- Mehtab Ur Rahman
- Department of Language and Communication, Radboud University, Houtlaan, Nijmegen, Gelderland, 6525, Netherlands.
- Electrical and Electronics Engineering Department, Middle East Technical University, Northern Cyprus Campus, Kalkanli, Güzelyurt, Mersin 10, 99738, Turkey.
| | - Cem Direkoglu
- Electrical and Electronics Engineering Department, Middle East Technical University, Northern Cyprus Campus, Kalkanli, Güzelyurt, Mersin 10, 99738, Turkey
| |
Collapse
|
2
|
Kuo HC, Hsieh YP, Tseng HH, Wang CT, Fang SH, Tsao Y. Toward Real-World Voice Disorder Classification. IEEE Trans Biomed Eng 2023; 70:2922-2932. [PMID: 37099463 DOI: 10.1109/tbme.2023.3270532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/27/2023]
Abstract
OBJECTIVE Voice disorders significantly compromise individuals' ability to speak in their daily lives. Without early diagnosis and treatment, these disorders may deteriorate drastically. Thus, automatic classification systems at home are desirable for people who are inaccessible to clinical disease assessments. However, the performance of such systems may be weakened due to the constrained resources and domain mismatch between the clinical data and noisy real-world data. METHODS This study develops a compact and domain-robust voice disorder classification system to identify the utterances of health, neoplasm, and benign structural diseases. Our proposed system utilizes a feature extractor model composed of factorized convolutional neural networks and subsequently deploys domain adversarial training to reconcile the domain mismatch by extracting domain-invariant features. RESULTS The results show that the unweighted average recall in the noisy real-world domain improved by 13% and remained at 80% in the clinic domain with only slight degradation. The domain mismatch was effectively eliminated. Moreover, the proposed system reduced the usage of both memory and computation by over 73.9%. CONCLUSION By deploying factorized convolutional neural networks and domain adversarial training, domain-invariant features can be derived for voice disorder classification with limited resources. The promising results confirm that the proposed system can significantly reduce resource consumption and improve classification accuracy by considering the domain mismatch. SIGNIFICANCE To the best of our knowledge, this is the first study that jointly considers real-world model compression and noise-robustness issues in voice disorder classification. The proposed system is intended for application to embedded systems with limited resources.
Collapse
|
3
|
Maskeliūnas R, Kulikajevas A, Damaševičius R, Pribuišis K, Ulozaitė-Stanienė N, Uloza V. Lightweight Deep Learning Model for Assessment of Substitution Voicing and Speech after Laryngeal Carcinoma Surgery. Cancers (Basel) 2022; 14:cancers14102366. [PMID: 35625971 PMCID: PMC9139213 DOI: 10.3390/cancers14102366] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 05/03/2022] [Accepted: 05/04/2022] [Indexed: 11/16/2022] Open
Abstract
Laryngeal carcinoma is the most common malignant tumor of the upper respiratory tract. Total laryngectomy provides complete and permanent detachment of the upper and lower airways that causes the loss of voice, leading to a patient's inability to verbally communicate in the postoperative period. This paper aims to exploit modern areas of deep learning research to objectively classify, extract and measure the substitution voicing after laryngeal oncosurgery from the audio signal. We propose using well-known convolutional neural networks (CNNs) applied for image classification for the analysis of voice audio signal. Our approach takes an input of Mel-frequency spectrogram (MFCC) as an input of deep neural network architecture. A database of digital speech recordings of 367 male subjects (279 normal speech samples and 88 pathological speech samples) was used. Our approach has shown the best true-positive rate of any of the compared state-of-the-art approaches, achieving an overall accuracy of 89.47%.
Collapse
Affiliation(s)
- Rytis Maskeliūnas
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
| | - Audrius Kulikajevas
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
| | - Robertas Damaševičius
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
- Correspondence:
| | - Kipras Pribuišis
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| | - Nora Ulozaitė-Stanienė
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| | - Virgilijus Uloza
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| |
Collapse
|
4
|
Petmezas G, Stefanopoulos L, Kilintzis V, Tzavelis A, Rogers JA, Katsaggelos AK, Maglaveras N. State-of-the-art Deep Learning Methods on Electrocardiogram Data: A Systematic Review (Preprint). JMIR Med Inform 2022; 10:e38454. [PMID: 35969441 PMCID: PMC9425174 DOI: 10.2196/38454] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Revised: 06/03/2022] [Accepted: 07/03/2022] [Indexed: 11/13/2022] Open
Abstract
Background Electrocardiogram (ECG) is one of the most common noninvasive diagnostic tools that can provide useful information regarding a patient’s health status. Deep learning (DL) is an area of intense exploration that leads the way in most attempts to create powerful diagnostic models based on physiological signals. Objective This study aimed to provide a systematic review of DL methods applied to ECG data for various clinical applications. Methods The PubMed search engine was systematically searched by combining “deep learning” and keywords such as “ecg,” “ekg,” “electrocardiogram,” “electrocardiography,” and “electrocardiology.” Irrelevant articles were excluded from the study after screening titles and abstracts, and the remaining articles were further reviewed. The reasons for article exclusion were manuscripts written in any language other than English, absence of ECG data or DL methods involved in the study, and absence of a quantitative evaluation of the proposed approaches. Results We identified 230 relevant articles published between January 2020 and December 2021 and grouped them into 6 distinct medical applications, namely, blood pressure estimation, cardiovascular disease diagnosis, ECG analysis, biometric recognition, sleep analysis, and other clinical analyses. We provide a complete account of the state-of-the-art DL strategies per the field of application, as well as major ECG data sources. We also present open research problems, such as the lack of attempts to address the issue of blood pressure variability in training data sets, and point out potential gaps in the design and implementation of DL models. Conclusions We expect that this review will provide insights into state-of-the-art DL methods applied to ECG data and point to future directions for research on DL to create robust models that can assist medical experts in clinical decision-making.
Collapse
Affiliation(s)
- Georgios Petmezas
- Lab of Computing, Medical Informatics and Biomedical-Imaging Technologies, The Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Leandros Stefanopoulos
- Lab of Computing, Medical Informatics and Biomedical-Imaging Technologies, The Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Vassilis Kilintzis
- Lab of Computing, Medical Informatics and Biomedical-Imaging Technologies, The Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Andreas Tzavelis
- Department of Biomedical Engineering, Northwestern University, Evanston, IL, United States
| | - John A Rogers
- Department of Material Science, Northwestern University, Evanston, IL, United States
| | - Aggelos K Katsaggelos
- Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, United States
| | - Nicos Maglaveras
- Lab of Computing, Medical Informatics and Biomedical-Imaging Technologies, The Medical School, Aristotle University of Thessaloniki, Thessaloniki, Greece
| |
Collapse
|
5
|
Wang SS, Wang CT, Lai CC, Tsao Y, Fang SH. Continuous Speech for Improved Learning Pathological Voice Disorders. IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY 2022; 3:25-33. [PMID: 35399790 PMCID: PMC8940190 DOI: 10.1109/ojemb.2022.3151233] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Revised: 01/11/2022] [Accepted: 02/09/2022] [Indexed: 11/05/2022] Open
Affiliation(s)
- Syu-Siang Wang
- Department of Electrical EngineeringYuan Ze University Taoyuan 320 Taiwan
| | - Chi-Te Wang
- Department of Electrical EngineeringYuan Ze University Taoyuan 320 Taiwan
- Department of Otolaryngology Head and Neck SurgeryFar Eastern Memorial Hospital New Taipei 220 Taiwan
| | - Chih-Chung Lai
- Department of Electrical EngineeringYuan Ze University Taoyuan 320 Taiwan
| | - Yu Tsao
- Research Center for Information Technology InnovationAcademia Sinica Taipei 115 Taiwan
| | - Shih-Hau Fang
- Department of Electrical EngineeringYuan Ze University Taoyuan 320 Taiwan
| |
Collapse
|
6
|
Geng L, Shan H, Xiao Z, Wang W, Wei M. Voice pathology detection and classification from speech signals and EGG signals based on a multimodal fusion method. BIOMED ENG-BIOMED TE 2021; 66:613-625. [PMID: 34845886 DOI: 10.1515/bmt-2021-0112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 11/12/2021] [Indexed: 11/15/2022]
Abstract
Automatic voice pathology detection and classification plays an important role in the diagnosis and prevention of voice disorders. To accurately describe the pronunciation characteristics of patients with dysarthria and improve the effect of pathological voice detection, this study proposes a pathological voice detection method based on a multi-modal network structure. First, speech signals and electroglottography (EGG) signals are mapped from the time domain to the frequency domain spectrogram via a short-time Fourier transform (STFT). The Mel filter bank acts on the spectrogram to enhance the signal's harmonics and denoise. Second, a pre-trained convolutional neural network (CNN) is used as the backbone network to extract sound state features and vocal cord vibration features from the two signals. To obtain a better classification effect, the fused features are input into the long short-term memory (LSTM) network for voice feature selection and enhancement. The proposed system achieves 95.73% for accuracy with 96.10% F1-score and 96.73% recall using the Saarbrucken Voice Database (SVD); thus, enabling a new method for pathological speech detection.
Collapse
Affiliation(s)
- Lei Geng
- School of Life Sciences, Tiangong University, Tianjin, China.,Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China
| | - Hongfeng Shan
- School of Electronic and Information Engineering, Tiangong University, Tianjin, China.,Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China
| | - Zhitao Xiao
- School of Life Sciences, Tiangong University, Tianjin, China.,Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China
| | - Wei Wang
- Department of Otorhinolaryngology Head and Neck Surgery, Tianjin First Central Hospital, Tianjin, China.,Institute of Otolaryngology of Tianjin, Tianjin, China.,Key Laboratory of Auditory Speech and Balance Medicine, Tianjin, China.,Key Clinical Discipline of Tianjin (Otolaryngology), Tianjin, China.,Otolaryngology Clinical Quality Control Centre, Tianjin, China
| | - Mei Wei
- Department of Otorhinolaryngology Head and Neck Surgery, Tianjin First Central Hospital, Tianjin, China.,Institute of Otolaryngology of Tianjin, Tianjin, China.,Key Laboratory of Auditory Speech and Balance Medicine, Tianjin, China.,Key Clinical Discipline of Tianjin (Otolaryngology), Tianjin, China.,Otolaryngology Clinical Quality Control Centre, Tianjin, China
| |
Collapse
|
7
|
Hu HC, Chang SY, Wang CH, Li KJ, Cho HY, Chen YT, Lu CJ, Tsai TP, Lee OKS. Deep Learning Application for Vocal Fold Disease Prediction Through Voice Recognition: Preliminary Development Study. J Med Internet Res 2021; 23:e25247. [PMID: 34100770 PMCID: PMC8241431 DOI: 10.2196/25247] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Revised: 04/06/2021] [Accepted: 04/25/2021] [Indexed: 01/17/2023] Open
Abstract
Background Dysphonia influences the quality of life by interfering with communication. However, a laryngoscopic examination is expensive and not readily accessible in primary care units. Experienced laryngologists are required to achieve an accurate diagnosis. Objective This study sought to detect various vocal fold diseases through pathological voice recognition using artificial intelligence. Methods We collected 189 normal voice samples and 552 samples of individuals with voice disorders, including vocal atrophy (n=224), unilateral vocal paralysis (n=50), organic vocal fold lesions (n=248), and adductor spasmodic dysphonia (n=30). The 741 samples were divided into 2 sets: 593 samples as the training set and 148 samples as the testing set. A convolutional neural network approach was applied to train the model, and findings were compared with those of human specialists. Results The convolutional neural network model achieved a sensitivity of 0.66, a specificity of 0.91, and an overall accuracy of 66.9% for distinguishing normal voice, vocal atrophy, unilateral vocal paralysis, organic vocal fold lesions, and adductor spasmodic dysphonia. Compared with the accuracy of human specialists, the overall accuracy rates were 60.1% and 56.1% for the 2 laryngologists and 51.4% and 43.2% for the 2 general ear, nose, and throat doctors. Conclusions Voice alone could be used for common vocal fold disease recognition through a deep learning approach after training with our Mandarin pathological voice database. This approach involving artificial intelligence could be clinically useful for screening general vocal fold disease using the voice. The approach includes a quick survey and a general health examination. It can be applied during telemedicine in areas with primary care units lacking laryngoscopic abilities. It could support physicians when prescreening cases by allowing for invasive examinations to be performed only for cases involving problems with automatic recognition or listening and for professional analyses of other clinical examination results that reveal doubts about the presence of pathologies.
Collapse
Affiliation(s)
- Hao-Chun Hu
- Institute of Clinical Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan.,Department of Otorhinolaryngology-Head and Neck Surgery, Fu Jen Catholic University Hospital, Fu Jen Catholic University, New Taipei City, Taiwan.,School of Medicine, College of Medicine, Fu Jen Catholic University, New Taipei City, Taiwan
| | - Shyue-Yih Chang
- Voice Center, Department of Otolaryngology, Cheng Hsin General Hospital, Taipei, Taiwan
| | - Chuen-Heng Wang
- Muen Biomedical and Optoelectronic Technologist Inc, Taipei, Taiwan
| | - Kai-Jun Li
- Department of Otorhinolaryngology-Head and Neck Surgery, Fu Jen Catholic University Hospital, Fu Jen Catholic University, New Taipei City, Taiwan
| | - Hsiao-Yun Cho
- Department of Otorhinolaryngology-Head and Neck Surgery, Fu Jen Catholic University Hospital, Fu Jen Catholic University, New Taipei City, Taiwan.,Graduate Institute of Business Administration, Fu Jen Catholic University, New Taipei City, Taiwan
| | - Yi-Ting Chen
- Muen Biomedical and Optoelectronic Technologist Inc, Taipei, Taiwan
| | - Chang-Jung Lu
- Voice Center, Department of Otolaryngology, Cheng Hsin General Hospital, Taipei, Taiwan
| | - Tzu-Pei Tsai
- Voice Center, Department of Otolaryngology, Cheng Hsin General Hospital, Taipei, Taiwan
| | - Oscar Kuang-Sheng Lee
- Institute of Clinical Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan.,Department of Orthopedics, China Medical University Hospital, Taichung, Taiwan.,Stem Cell Research Center, National Yang Ming Chiao Tung University, Taipei, Taiwan.,Department of Medical Research, Taipei Veterans General Hospital, Taipei, Taiwan
| |
Collapse
|
8
|
Bensoussan Y, Pinto J, Crowson M, Walden PR, Rudzicz F, Johns M. Deep Learning for Voice Gender Identification: Proof-of-concept for Gender-Affirming Voice Care. Laryngoscope 2020; 131:E1611-E1615. [PMID: 33219707 DOI: 10.1002/lary.29281] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 11/04/2020] [Accepted: 11/06/2020] [Indexed: 11/08/2022]
Abstract
OBJECTIVES/HYPOTHESIS The need for gender-affirming voice care has been increasing in the transgender population in the last decade. Currently, objective treatment outcome measurements are lacking to assess the success of these interventions. This study uses neural network models to predict binary gender from short audio samples of "male" and "female" voices. This preliminary work is a proof-of-concept for further work to develop an AI-assisted treatment outcome measure for gender-affirming voice care. STUDY DESIGN Retrospective cohort study. METHODS Two hundred seventy-eight voices from male and female speakers from the Perceptual Voice Qualities Database were used to train a deep neural network to classify voices as male or female. Each audio sample was mapped to the frequency domain using Mel spectrograms. To optimize model performance, we performed 10-fold cross validation of the entire dataset. The dataset was split into 80% training, 10% validation, and 10% test. RESULTS Overall accuracy of 92% was obtained, both when considering the accuracy per spectrum and per patient metric. The accuracy of the model was higher for recognizing female voices (F1 score of 0.94) compared to male voices (F1 score of 0.87). CONCLUSIONS This proof of concept study shows promising performance for further development of an AI-assisted tool to provide objective treatment outcome measurements for gender affirming voice care. LEVEL OF EVIDENCE 3 Laryngoscope, 131:E1611-E1615, 2021.
Collapse
Affiliation(s)
- Yael Bensoussan
- Caruso Department of Otolaryngology - HNS, University of Southern California, Los Angeles, California, U.S.A
| | - Jeremy Pinto
- Mila, Quebec Artificial Intelligence Institute, Montreal, Quebec, Canada
| | - Matthew Crowson
- Harvard Department of Otolaryngology - HNS, Massachusetts Eye and Ear Infirmary, Boston, Massachusetts, U.S.A
| | - Patrick R Walden
- Department of Communication Sciences and Disorders, St. John's University, Queens, New York, U.S.A
| | - Frank Rudzicz
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
| | - Michael Johns
- Caruso Department of Otolaryngology - Head and Neck Surgery, University of Southern California, Los Angeles, California, U.S.A
| |
Collapse
|