1
|
Geng L, Liang Y, Shan H, Xiao Z, Wang W, Wei M. Pathological Voice Detection and Classification Based on Multimodal Transmission Network. J Voice 2025; 39:591-601. [PMID: 36470823 DOI: 10.1016/j.jvoice.2022.11.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Revised: 11/11/2022] [Accepted: 11/11/2022] [Indexed: 12/12/2022]
Abstract
OBJECTIVES Describing pronunciation features from multiple perspectives can help doctors accurately diagnose the pathological type of a patient's voice. According to the two modal information of sound signal and electroglottography (EGG) signal, this paper proposes a pathological voice detection and classification algorithm based on multimodal transmission network. METHODS Firstly, we used the short-time Fourier transform (STFT) to map the features of the two signals, and designed the Mel filter to obtain the Mel spectogram. Then, the constructed multimodal transmission network extracted features from Mel spectogram and applied Multimodal Transfer Module (MMTM) module. Finally, the fusion layer can integrate multimodal information, and the full connection layer diagnoses and classifies voice pathology according to the fused features. RESULTS The experiment was based on 1179 subjects in Saarbrücken voice database (SVD), and the average accuracy, recall, specificity and F1 score of pathological voice classification reached 98.02%, 98.23%, 97.82% and 97.95% respectively. Compared with other algorithms, the classification accuracy is significantly improved. CONCLUSIONS The proposed model can integrate multiple modal information to obtain more comprehensive and stable voice features and improve the accuracy of pathological voice classification. Future research will further explore in reducing the time-consuming and complexity of the model.
Collapse
Affiliation(s)
- Lei Geng
- School of Life Sciences, Tiangong University, Tianjin, China; Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China
| | - Yan Liang
- Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China; School of Electronic and Information Engineering, Tiangong University, Tianjin, China
| | - Hongfeng Shan
- Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China; School of Electronic and Information Engineering, Tiangong University, Tianjin, China
| | - Zhitao Xiao
- School of Life Sciences, Tiangong University, Tianjin, China; Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China
| | - Wei Wang
- Department of Otorhinolaryngology Head and Neck Surgery, Tianjin First Central Hospital, Tianjin, China; Institute of Otolaryngology of Tianjin, Tianjin, China; Key Laboratory of Auditory Speech and Balance Medicine, Tianjin, China; Key Clinical Discipline of Tianjin (Otolaryngology), Tianjin, China; Otolaryngology Clinical Quality Control Centre, Tianjin, China
| | - Mei Wei
- Department of Otorhinolaryngology Head and Neck Surgery, Tianjin First Central Hospital, Tianjin, China; Institute of Otolaryngology of Tianjin, Tianjin, China; Key Laboratory of Auditory Speech and Balance Medicine, Tianjin, China; Key Clinical Discipline of Tianjin (Otolaryngology), Tianjin, China; Otolaryngology Clinical Quality Control Centre, Tianjin, China.
| |
Collapse
|
2
|
Vrba J, Steinbach J, Jirsa T, Verde L, De Fazio R, Zeng Y, Ichiji K, Hájek L, Sedláková Z, Urbániová Z, Chovanec M, Mareš J, Homma N. Reproducible Machine Learning-Based Voice Pathology Detection: Introducing the Pitch Difference Feature. J Voice 2025:S0892-1997(25)00122-5. [PMID: 40221253 DOI: 10.1016/j.jvoice.2025.03.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2025] [Revised: 03/14/2025] [Accepted: 03/14/2025] [Indexed: 04/14/2025]
Abstract
PURPOSE We introduce a novel methodology for voice pathology detection using the publicly available Saarbrücken Voice Database and a robust feature set combining commonly used acoustic handcrafted features with two novel ones: pitch difference (relative variation in fundamental frequency) and NaN feature (failed fundamental frequency estimation). METHODS We evaluate six machine learning (ML) algorithms-support vector machine, k-nearest neighbors, naive Bayes, decision tree, random forest, and AdaBoost-using grid search for feasible hyperparameters and 20 480 different feature subsets. Top 1000 classification models-feature subset combinations for each ML algorithm are validated with repeated stratified cross-validation. To address class imbalance, we apply k-means synthetic minority oversampling technique to augment the training data. RESULTS Our approach achieves 85.61%, 84.69%, and 85.22% unweighted average recall for females, males, and combined results, respectively. We intentionally omit accuracy as it is a highly biased metric for imbalanced data. CONCLUSION Our study demonstrates that by following the proposed methodology and feature engineering, there is a potential in detection of various voice pathologies using ML models applied to the simplest vocal task, a sustained utterance of the vowel /a:/. To enable easier use of our methodology and to support our claims, we provide a publicly available GitHub repository with DOI 10.5281/zenodo.13771573. Finally, we provide a REFORMS checklist to enhance readability, reproducibility, and justification of our approach.
Collapse
Affiliation(s)
- Jan Vrba
- Department of Mathematics, Informatics, and Cybernetics, University of Chemistry and Technology, Technická 5, Prague 166 28, Czech Republic; Department of Radiological Imaging and Informatics, Tohoku University Graduate School of Medicine, 2-1-1 Katahira, Aoba-ku, Sendai 980-8577, Japan.
| | - Jakub Steinbach
- Department of Mathematics, Informatics, and Cybernetics, University of Chemistry and Technology, Technická 5, Prague 166 28, Czech Republic; Department of Radiological Imaging and Informatics, Tohoku University Graduate School of Medicine, 2-1-1 Katahira, Aoba-ku, Sendai 980-8577, Japan.
| | - Tomáš Jirsa
- Department of Mathematics, Informatics, and Cybernetics, University of Chemistry and Technology, Technická 5, Prague 166 28, Czech Republic.
| | - Laura Verde
- Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", Viale Abramo Lincoln 5, Caserta 81100, Italy.
| | - Roberta De Fazio
- Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", Viale Abramo Lincoln 5, Caserta 81100, Italy.
| | - Yuwen Zeng
- Department of Radiological Imaging and Informatics, Tohoku University Graduate School of Medicine, 2-1-1 Katahira, Aoba-ku, Sendai 980-8577, Japan.
| | - Kei Ichiji
- Department of Radiological Imaging and Informatics, Tohoku University Graduate School of Medicine, 2-1-1 Katahira, Aoba-ku, Sendai 980-8577, Japan.
| | - Lukáš Hájek
- Department of Mathematics, Informatics, and Cybernetics, University of Chemistry and Technology, Technická 5, Prague 166 28, Czech Republic.
| | - Zuzana Sedláková
- Department of Mathematics, Informatics, and Cybernetics, University of Chemistry and Technology, Technická 5, Prague 166 28, Czech Republic.
| | - Zuzana Urbániová
- Department of Otorhinolaryngology, Faculty Hospital Královské Vinohrady, Šrobárova 1150/50, Prague 100 34, Czech Republic.
| | - Martin Chovanec
- Department of Otorhinolaryngology, Faculty Hospital Královské Vinohrady, Šrobárova 1150/50, Prague 100 34, Czech Republic.
| | - Jan Mareš
- Department of Mathematics, Informatics, and Cybernetics, University of Chemistry and Technology, Technická 5, Prague 166 28, Czech Republic.
| | - Noriyasu Homma
- Department of Radiological Imaging and Informatics, Tohoku University Graduate School of Medicine, 2-1-1 Katahira, Aoba-ku, Sendai 980-8577, Japan.
| |
Collapse
|
3
|
Liu Y, Zhang C, Liu Z, Li J. Reliability and Validity of GRBASzero in Clinical Environments. J Voice 2024:S0892-1997(24)00199-1. [PMID: 38987039 DOI: 10.1016/j.jvoice.2024.06.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Revised: 06/20/2024] [Accepted: 06/21/2024] [Indexed: 07/12/2024]
Abstract
PURPOSE This study aimed to evaluate the reliability and validity of GRBASzero in a real clinical setting. METHODS The reliability and validity of GRBASzero were assessed using two independent datasets. Dataset 1 included 283 outpatients who underwent both GRBASzero assessment and human expert evaluation. Dataset 2 from the Perceptual Voice Qualities Database comprised 287 voice samples that underwent evaluation by GRBASzero and were subsequently compared with GRBAS (Grade, Roughness, Breathiness, Asthenicity, Strain) ratings provided by human experts. The reliability of GRBASzero was assessed using Fleiss Kappa, while the validity of GRBASzero was examined using the intraclass correlation coefficient. RESULTS In dataset 1, the test-retest reliability of GRBASzero was poor, with the consistency of features A and S approaching random allocation. Consistency analysis with human experts showed a poor agreement for all features except for B. In dataset 2, there was also a poor agreement between GRBASzero and human experts. CONCLUSION The reliability and validity of GRBASzero in a real clinical environment are poor and do not meet the requirements for clinical testing, indicating the need for further optimization and improvement.
Collapse
Affiliation(s)
- Yang Liu
- Department of Otolaryngology-Head and Neck Surgery, The 6th Medical Center of Chinese PLA General Hospital, Beijing, China
| | - Chun Zhang
- Department of Otolaryngology-Head and Neck Surgery, The 6th Medical Center of Chinese PLA General Hospital, Beijing, China
| | - Zhi Liu
- Department of Otolaryngology-Head and Neck Surgery, The 6th Medical Center of Chinese PLA General Hospital, Beijing, China
| | - JinRang Li
- Department of Otolaryngology-Head and Neck Surgery, The 6th Medical Center of Chinese PLA General Hospital, Beijing, China.
| |
Collapse
|
4
|
Kuo HC, Hsieh YP, Tseng HH, Wang CT, Fang SH, Tsao Y. Toward Real-World Voice Disorder Classification. IEEE Trans Biomed Eng 2023; 70:2922-2932. [PMID: 37099463 DOI: 10.1109/tbme.2023.3270532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/27/2023]
Abstract
OBJECTIVE Voice disorders significantly compromise individuals' ability to speak in their daily lives. Without early diagnosis and treatment, these disorders may deteriorate drastically. Thus, automatic classification systems at home are desirable for people who are inaccessible to clinical disease assessments. However, the performance of such systems may be weakened due to the constrained resources and domain mismatch between the clinical data and noisy real-world data. METHODS This study develops a compact and domain-robust voice disorder classification system to identify the utterances of health, neoplasm, and benign structural diseases. Our proposed system utilizes a feature extractor model composed of factorized convolutional neural networks and subsequently deploys domain adversarial training to reconcile the domain mismatch by extracting domain-invariant features. RESULTS The results show that the unweighted average recall in the noisy real-world domain improved by 13% and remained at 80% in the clinic domain with only slight degradation. The domain mismatch was effectively eliminated. Moreover, the proposed system reduced the usage of both memory and computation by over 73.9%. CONCLUSION By deploying factorized convolutional neural networks and domain adversarial training, domain-invariant features can be derived for voice disorder classification with limited resources. The promising results confirm that the proposed system can significantly reduce resource consumption and improve classification accuracy by considering the domain mismatch. SIGNIFICANCE To the best of our knowledge, this is the first study that jointly considers real-world model compression and noise-robustness issues in voice disorder classification. The proposed system is intended for application to embedded systems with limited resources.
Collapse
|
5
|
Warule P, Mishra SP, Deb S, Krajewski J. Sinusoidal model-based diagnosis of the common cold from the speech signal. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2023.104653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
|
6
|
A review on voice pathology: Taxonomy, diagnosis, medical procedures and detection techniques, open challenges, limitations, and recommendations for future directions. JOURNAL OF INTELLIGENT SYSTEMS 2022. [DOI: 10.1515/jisys-2022-0058] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Abstract
Speech is a primary means of human communication and one of the most basic features of human conduct. Voice is an important part of its subsystems. A speech disorder is a condition that affects the ability of a person to speak normally, which occasionally results in voice impairment with psychological and emotional consequences. Early detection of voice problems is a crucial factor. Computer-based procedures are less costly and easier to administer for such purposes than traditional methods. This study highlights the following issues: recent studies, methods of voice pathology detection, machine learning and deep learning (DL) methods used in data classification, main datasets utilized, and the role of Internet of things (IoT) systems employed in voice pathology diagnosis. Moreover, this study presents different applications, open challenges, and recommendations for future directions of IoT systems and artificial intelligence (AI) approaches in the voice pathology diagnosis. Finally, this study highlights some limitations of voice pathology datasets in comparison with the role of IoT in the healthcare sector, which shows the urgent need to provide efficient approaches and easy and ideal medical diagnostic procedures and treatments of disease identification for doctors and patients. This review covered voice pathology taxonomy, detection techniques, open challenges, limitations, and recommendations for future directions to provide a clear background for doctors and patients. Standard databases, including the Massachusetts Eye and Ear Infirmary, Saarbruecken Voice Database, and the Arabic Voice Pathology Database, were used in most articles reviewed in this article. The classes, features, and main purpose for voice pathology identification are also highlighted. This study focuses on the extraction of voice pathology features, especially speech analysis, extends feature vectors comprising static and dynamic features, and converts these extended feature vectors into solid vectors before passing them to the recognizer.
Collapse
|
7
|
Maskeliūnas R, Kulikajevas A, Damaševičius R, Pribuišis K, Ulozaitė-Stanienė N, Uloza V. Lightweight Deep Learning Model for Assessment of Substitution Voicing and Speech after Laryngeal Carcinoma Surgery. Cancers (Basel) 2022; 14:cancers14102366. [PMID: 35625971 PMCID: PMC9139213 DOI: 10.3390/cancers14102366] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 05/03/2022] [Accepted: 05/04/2022] [Indexed: 11/16/2022] Open
Abstract
Laryngeal carcinoma is the most common malignant tumor of the upper respiratory tract. Total laryngectomy provides complete and permanent detachment of the upper and lower airways that causes the loss of voice, leading to a patient's inability to verbally communicate in the postoperative period. This paper aims to exploit modern areas of deep learning research to objectively classify, extract and measure the substitution voicing after laryngeal oncosurgery from the audio signal. We propose using well-known convolutional neural networks (CNNs) applied for image classification for the analysis of voice audio signal. Our approach takes an input of Mel-frequency spectrogram (MFCC) as an input of deep neural network architecture. A database of digital speech recordings of 367 male subjects (279 normal speech samples and 88 pathological speech samples) was used. Our approach has shown the best true-positive rate of any of the compared state-of-the-art approaches, achieving an overall accuracy of 89.47%.
Collapse
Affiliation(s)
- Rytis Maskeliūnas
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
| | - Audrius Kulikajevas
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
| | - Robertas Damaševičius
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
- Correspondence:
| | - Kipras Pribuišis
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| | - Nora Ulozaitė-Stanienė
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| | - Virgilijus Uloza
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| |
Collapse
|
8
|
Geng L, Shan H, Xiao Z, Wang W, Wei M. Voice pathology detection and classification from speech signals and EGG signals based on a multimodal fusion method. BIOMED ENG-BIOMED TE 2021; 66:613-625. [PMID: 34845886 DOI: 10.1515/bmt-2021-0112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 11/12/2021] [Indexed: 11/15/2022]
Abstract
Automatic voice pathology detection and classification plays an important role in the diagnosis and prevention of voice disorders. To accurately describe the pronunciation characteristics of patients with dysarthria and improve the effect of pathological voice detection, this study proposes a pathological voice detection method based on a multi-modal network structure. First, speech signals and electroglottography (EGG) signals are mapped from the time domain to the frequency domain spectrogram via a short-time Fourier transform (STFT). The Mel filter bank acts on the spectrogram to enhance the signal's harmonics and denoise. Second, a pre-trained convolutional neural network (CNN) is used as the backbone network to extract sound state features and vocal cord vibration features from the two signals. To obtain a better classification effect, the fused features are input into the long short-term memory (LSTM) network for voice feature selection and enhancement. The proposed system achieves 95.73% for accuracy with 96.10% F1-score and 96.73% recall using the Saarbrucken Voice Database (SVD); thus, enabling a new method for pathological speech detection.
Collapse
Affiliation(s)
- Lei Geng
- School of Life Sciences, Tiangong University, Tianjin, China.,Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China
| | - Hongfeng Shan
- School of Electronic and Information Engineering, Tiangong University, Tianjin, China.,Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China
| | - Zhitao Xiao
- School of Life Sciences, Tiangong University, Tianjin, China.,Tianjin Key Laboratory of Optoelectronic Detection Technology and System, Tianjin, China
| | - Wei Wang
- Department of Otorhinolaryngology Head and Neck Surgery, Tianjin First Central Hospital, Tianjin, China.,Institute of Otolaryngology of Tianjin, Tianjin, China.,Key Laboratory of Auditory Speech and Balance Medicine, Tianjin, China.,Key Clinical Discipline of Tianjin (Otolaryngology), Tianjin, China.,Otolaryngology Clinical Quality Control Centre, Tianjin, China
| | - Mei Wei
- Department of Otorhinolaryngology Head and Neck Surgery, Tianjin First Central Hospital, Tianjin, China.,Institute of Otolaryngology of Tianjin, Tianjin, China.,Key Laboratory of Auditory Speech and Balance Medicine, Tianjin, China.,Key Clinical Discipline of Tianjin (Otolaryngology), Tianjin, China.,Otolaryngology Clinical Quality Control Centre, Tianjin, China
| |
Collapse
|