1
|
Wujian Y, Yingcong Z, Yuehai C, Yijun L, Zhiwei M. Post-Stroke Dysarthria Voice Recognition based on Fusion Feature MSA and 1D. Comput Methods Biomech Biomed Engin 2024:1-11. [PMID: 39422438 DOI: 10.1080/10255842.2024.2410228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 09/04/2024] [Accepted: 09/20/2024] [Indexed: 10/19/2024]
Abstract
Post-stroke Dysarthria (PSD) is one of the common sequelae of stroke. PSD can harm patients' quality of life and, in severe cases, be life-threatening. Most of the existing methods use frequency domain features to recognize the pathological voice, which makes it hard to completely represent the characteristics of pathological voice. Although some results have been achieved, there is still a long way to go for practical applications. Therefore, an improved deep learning-based model is proposed to classify between the pathological voice and the normal voice, using a novel fusion feature (MSA) and an improved 1D ResNet network hybrid bi-directional LSTM with dilated convolution (named 1D DRN-biLSTM). The experimental results show that our fusion features bring greater improvement in pathological speech recognition than the method that only analyzes the MFCC features, and can better synthesize the hidden features that characterize pathological speech. In terms of model structure, the introduction of dilated convolution and LSTM can further improve the performance of the 1D Resnet network, compared to ordinary networks such as CNN and LSTM. The accuracy of this method reaches 82.41% and 100% at the syllable level and speaker level, respectively. Our scheme outperforms other existing methods in terms of feature learning capability and recognition rate, and will help to play an important role in the assessment and diagnosis of PSD in China.
Collapse
Affiliation(s)
- Ye Wujian
- School of Integrated Circuit, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Zheng Yingcong
- School of Integrated Circuit, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Chen Yuehai
- School of Integrated Circuit, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Liu Yijun
- School of Integrated Circuit, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Mou Zhiwei
- Department of Rehabilitation, Guangzhou Red Cross Hospital of Jinan University, Guangzhou 510240, China
| |
Collapse
|
2
|
Cai J, Song Y, Wu J, Chen X. Voice Disorder Classification Using Wav2vec 2.0 Feature Extraction. J Voice 2024:S0892-1997(24)00293-5. [PMID: 39327203 DOI: 10.1016/j.jvoice.2024.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 08/30/2024] [Accepted: 09/03/2024] [Indexed: 09/28/2024]
Abstract
OBJECTIVES The study aims to classify normal and pathological voices by leveraging the wav2vec 2.0 model as a feature extraction method in conjunction with machine learning classifiers. METHODS Voice recordings were sourced from the publicly accessible VOICED database. The data underwent preprocessing, including normalization and data augmentation, before being input into the wav2vec 2.0 model for feature extraction. The extracted features were then used to train four machine learning models-Support Vector Machine (SVM), K-Nearest Neighbors, Decision Tree (DT), and Random Forest (RF)-which were evaluated using Stratified K-Fold cross-validation. Performance metrics such as accuracy, precision, recall, F1-score, macro average, micro average, receiver-operating characteristic (ROC) curve, and confusion matrix were utilized to assess model performance. RESULTS The RF model achieved the highest accuracy (0.98 ± 0.02), alongside strong recall (0.97 ± 0.04), F1-score (0.95 ± 0.05), and consistently high area under the curve (AUC) values approaching 1.00, indicating superior classification performance. The DT model also demonstrated excellent performance, particularly in precision (0.97 ± 0.02) and F1-score (0.96 ± 0.02), with AUC values ranging from 0.86 to 1.00. Macro-averaged and micro-averaged analyses showed that the DT model provided the most balanced and consistent performance across all classes, while RF model exhibited robust performance across multiple metrics. Additionally, data augmentation significantly enhanced the performance of all models, with marked improvements in accuracy, recall, F1-score, and AUC values, especially notable in the RF and DT models. ROC curve analysis further confirms the consistency and reliability of the RF and SVM models across different folds, while confusion matrix analysis revealed that RF and SVM models had the fewest misclassifications in distinguishing "Normal" and "Pathological" samples. Consequently, RF and DT models emerged as the most robust performers, making them particularly well-suited for the voice classification task in this study. CONCLUSIONS The method of wav2vec 2.0 combining machine learning models proved highly effective in classifying normal and pathological voices, achieving exceptional accuracy and robustness across various machine evaluation metrics.
Collapse
Affiliation(s)
- Jie Cai
- Department of Otorhinolaryngology, Head and Neck Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Yuliang Song
- Department of Otorhinolaryngology, Head and Neck Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China; Department of Ear, Nose, and Throat-Head and Neck Surgery, University Hospital of Nancy, Hospital of Brabois, Vandoeuvre-les-Nancy, France
| | - Jianghao Wu
- Department of Otorhinolaryngology, Head and Neck Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Xiong Chen
- Department of Otorhinolaryngology, Head and Neck Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China.
| |
Collapse
|
3
|
Barlow J, Sragi Z, Rivera-Rivera G, Al-Awady A, Daşdöğen Ü, Courey MS, Kirke DN. The Use of Deep Learning Software in the Detection of Voice Disorders: A Systematic Review. Otolaryngol Head Neck Surg 2024; 170:1531-1543. [PMID: 38168017 DOI: 10.1002/ohn.636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 11/30/2023] [Accepted: 12/07/2023] [Indexed: 01/05/2024]
Abstract
OBJECTIVE To summarize the use of deep learning in the detection of voice disorders using acoustic and laryngoscopic input, compare specific neural networks in terms of accuracy, and assess their effectiveness compared to expert clinical visual examination. DATA SOURCES Embase, MEDLINE, and Cochrane Central. REVIEW METHODS Databases were screened through November 11, 2023 for relevant studies. The inclusion criteria required studies to utilize a specified deep learning method, use laryngoscopy or acoustic input, and measure accuracy of binary classification between healthy patients and those with voice disorders. RESULTS Thirty-four studies met the inclusion criteria, with 18 focusing on voice analysis, 15 on imaging analysis, and 1 both. Across the 18 acoustic studies, 21 programs were used for identification of organic and functional voice disorders. These technologies included 10 convolutional neural networks (CNNs), 6 multilayer perceptrons (MLPs), and 5 other neural networks. The binary classification systems yielded a mean accuracy of 89.0% overall, including 93.7% for MLP programs and 84.5% for CNNs. Among the 15 imaging analysis studies, a total of 23 programs were utilized, resulting in a mean accuracy of 91.3%. Specifically, the twenty CNNs achieved a mean accuracy of 92.6% compared to 83.0% for the 3 MLPs. CONCLUSION Deep learning models were shown to be highly accurate in the detection of voice pathology, with CNNs most effective for assessing laryngoscopy images and MLPs most effective for assessing acoustic input. While deep learning methods outperformed expert clinical exam in limited comparisons, further studies integrating external validation are necessary.
Collapse
Affiliation(s)
- Joshua Barlow
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Zara Sragi
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Gabriel Rivera-Rivera
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Abdurrahman Al-Awady
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Ümit Daşdöğen
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Mark S Courey
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| | - Diana N Kirke
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York City, New York, USA
| |
Collapse
|
4
|
Shaikh AAS, Bhargavi MS, Naik GR. Unraveling the complexities of pathological voice through saliency analysis. Comput Biol Med 2023; 166:107566. [PMID: 37857135 DOI: 10.1016/j.compbiomed.2023.107566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 09/14/2023] [Accepted: 10/10/2023] [Indexed: 10/21/2023]
Abstract
The human voice is an essential communication tool, but various disorders and habits can disrupt it. Diagnosis of pathological and abnormal voices is very important. Conventional diagnosis of these voice pathologies can be invasive and costly. Voice pathology disorders can be effectively detected using Artificial Intelligence and computer-aided voice pathology classification tools. Previous studies focused primarily on binary classification, leaving limited attention to multi-class classification. This study proposes three different neural network architectures to investigate the feature characteristics of three voice pathologies-Hyperkinetic Dysphonia, Hypokinetic Dysphonia, Reflux Laryngitis, and healthy voices using multi-class classification and the Voice ICar fEDerico II (VOICED) dataset. The study proposes UNet++ autoencoder-based denoiser techniques for accurate feature extraction to overcome noisy data. The architectures include a Multi-Layer Perceptron (MLP) trained on structured feature sets, a Short-Time Fourier Transform (STFT) model, and a Mel-Frequency Cepstral Coefficients (MFCC) model. The MLP model on 143 features achieved 97.1% accuracy, while the STFT model showed similar performance with increased sensitivity of 99.8%. The MFCC model maintained 97.1% accuracy but with a smaller model size and improved accuracy on the Reflux Laryngitis class. The study identifies crucial features through saliency analysis and reveals that detecting voice abnormalities requires the identification of regions of inaudible high-pitch sounds. Additionally, the study highlights the challenges posed by limited and disjointed pathological voice databases and proposes solutions for enhancing the performance of voice abnormality classification. Overall, the study's findings have potential applications in clinical applications and specialized audio-capturing tools.
Collapse
Affiliation(s)
- Abdullah Abdul Sattar Shaikh
- Department of Computer Science and Engineering, Bangalore Institute of Technology, Bangalore, 560004, Karnataka, India.
| | - M S Bhargavi
- Department of Computer Science and Engineering, Bangalore Institute of Technology, Bangalore, 560004, Karnataka, India.
| | - Ganesh R Naik
- Adelaide Institute for Sleep Health, Flinders University, Bedford Park 5042, Adelaide, SA, Australia.
| |
Collapse
|
5
|
Liu GS, Hodges JM, Yu J, Sung CK, Erickson‐DiRenzo E, Doyle PC. End-to-end deep learning classification of vocal pathology using stacked vowels. Laryngoscope Investig Otolaryngol 2023; 8:1312-1318. [PMID: 37899847 PMCID: PMC10601590 DOI: 10.1002/lio2.1144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Accepted: 08/13/2023] [Indexed: 10/31/2023] Open
Abstract
Objectives Advances in artificial intelligence (AI) technology have increased the feasibility of classifying voice disorders using voice recordings as a screening tool. This work develops upon previous models that take in single vowel recordings by analyzing multiple vowel recordings simultaneously to enhance prediction of vocal pathology. Methods Voice samples from the Saarbruecken Voice Database, including three sustained vowels (/a/, /i/, /u/) from 687 healthy human participants and 334 dysphonic patients, were used to train 1-dimensional convolutional neural network models for multiclass classification of healthy, hyperfunctional dysphonia, and laryngitis voice recordings. Three models were trained: (1) a baseline model that analyzed individual vowels in isolation, (2) a stacked vowel model that analyzed three vowels (/a/, /i/, /u/) in the neutral pitch simultaneously, and (3) a stacked pitch model that analyzed the /a/ vowel in three pitches (low, neutral, and high) simultaneously. Results For multiclass classification of healthy, hyperfunctional dysphonia, and laryngitis voice recordings, the stacked vowel model demonstrated higher performance compared with the baseline and stacked pitch models (F1 score 0.81 vs. 0.77 and 0.78, respectively). Specifically, the stacked vowel model achieved higher performance for class-specific classification of hyperfunctional dysphonia voice samples compared with the baseline and stacked pitch models (F1 score 0.56 vs. 0.49 and 0.50, respectively). Conclusions This study demonstrates the feasibility and potential of analyzing multiple sustained vowel recordings simultaneously to improve AI-driven screening and classification of vocal pathology. The stacked vowel model architecture in particular offers promise to enhance such an approach. Lay Summary AI analysis of multiple vowel recordings can improve classification of voice pathologies compared with models using a single sustained vowel and offer a strategy to enhance AI-driven screening of voice disorders. Level of Evidence 3.
Collapse
Affiliation(s)
- George S. Liu
- Department of Otolaryngology Head and Neck SurgeryStanford University School of Medicine, Stanford UniversityStanfordCaliforniaUSA
- Division of LaryngologyStanford University School of Medicine, Stanford UniversityStanfordCaliforniaUSA
| | - Jordan M. Hodges
- Computer Science DepartmentSchool of Engineering, Stanford UniversityStanfordCaliforniaUSA
| | - Jingzhi Yu
- Biomedical Informatics, Department of Biomedical Data ScienceStanford University School of MedicineStanfordCaliforniaUSA
| | - C. Kwang Sung
- Department of Otolaryngology Head and Neck SurgeryStanford University School of Medicine, Stanford UniversityStanfordCaliforniaUSA
- Division of LaryngologyStanford University School of Medicine, Stanford UniversityStanfordCaliforniaUSA
| | - Elizabeth Erickson‐DiRenzo
- Department of Otolaryngology Head and Neck SurgeryStanford University School of Medicine, Stanford UniversityStanfordCaliforniaUSA
- Division of LaryngologyStanford University School of Medicine, Stanford UniversityStanfordCaliforniaUSA
| | - Philip C. Doyle
- Department of Otolaryngology Head and Neck SurgeryStanford University School of Medicine, Stanford UniversityStanfordCaliforniaUSA
- Division of LaryngologyStanford University School of Medicine, Stanford UniversityStanfordCaliforniaUSA
| |
Collapse
|
6
|
Zhang J, Wu J, Qiu Y, Song A, Li W, Li X, Liu Y. Intelligent speech technologies for transcription, disease diagnosis, and medical equipment interactive control in smart hospitals: A review. Comput Biol Med 2023; 153:106517. [PMID: 36623438 PMCID: PMC9814440 DOI: 10.1016/j.compbiomed.2022.106517] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 12/23/2022] [Accepted: 12/31/2022] [Indexed: 01/07/2023]
Abstract
The growing and aging of the world population have driven the shortage of medical resources in recent years, especially during the COVID-19 pandemic. Fortunately, the rapid development of robotics and artificial intelligence technologies help to adapt to the challenges in the healthcare field. Among them, intelligent speech technology (IST) has served doctors and patients to improve the efficiency of medical behavior and alleviate the medical burden. However, problems like noise interference in complex medical scenarios and pronunciation differences between patients and healthy people hamper the broad application of IST in hospitals. In recent years, technologies such as machine learning have developed rapidly in intelligent speech recognition, which is expected to solve these problems. This paper first introduces IST's procedure and system architecture and analyzes its application in medical scenarios. Secondly, we review existing IST applications in smart hospitals in detail, including electronic medical documentation, disease diagnosis and evaluation, and human-medical equipment interaction. In addition, we elaborate on an application case of IST in the early recognition, diagnosis, rehabilitation training, evaluation, and daily care of stroke patients. Finally, we discuss IST's limitations, challenges, and future directions in the medical field. Furthermore, we propose a novel medical voice analysis system architecture that employs active hardware, active software, and human-computer interaction to realize intelligent and evolvable speech recognition. This comprehensive review and the proposed architecture offer directions for future studies on IST and its applications in smart hospitals.
Collapse
Affiliation(s)
- Jun Zhang
- The State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China,Corresponding author
| | - Jingyue Wu
- The State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China
| | - Yiyi Qiu
- The State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China
| | - Aiguo Song
- The State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China
| | - Weifeng Li
- Department of Emergency Medicine, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, 510080, China
| | - Xin Li
- Department of Emergency Medicine, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, 510080, China
| | - Yecheng Liu
- Emergency Department, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Science and Peking Union Medical College, Beijing, 100730, China
| |
Collapse
|
7
|
Wang J, Xu H, Peng X, Liu J, He C. Pathological voice classification based on multi-domain features and deep hierarchical extreme learning machine. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:423. [PMID: 36732280 DOI: 10.1121/10.0016869] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2022] [Accepted: 12/29/2022] [Indexed: 06/18/2023]
Abstract
The intelligent data-driven screening of pathological voice signals is a non-invasive and real-time tool for computer-aided diagnosis that has attracted increasing attention from researchers and clinicians. In this paper, the authors propose multi-domain features and the hierarchical extreme learning machine (H-ELM) for the automatic identification of voice disorders. A sufficient number of sensitive features are first extracted from the original voice signal through multi-domain feature extraction (i.e., features of the time domain and the sample entropy based on ensemble empirical mode decomposition and gammatone frequency cepstral coefficients). To eliminate redundancy in high-dimensional features, neighborhood component analysis is then applied to filter out sensitive features from the high-dimensional feature vectors to improve the efficiency of network training and reduce overfitting. The sensitive features thus obtained are then used to train the H-ELM for pathological voice classification. The results of the experiments showed that the sensitivity, specificity, F1 score, and accuracy of the H-ELM were 99.37%, 98.61%, 99.37%, and 98.99%, respectively. Therefore, the proposed method is feasible for the initial classification of pathological voice signals.
Collapse
Affiliation(s)
- Junlang Wang
- School of Mechanical Engineering, Southwest Jiaotong University, Chengdu, 610031, China
| | - Huoyao Xu
- School of Mechanical Engineering, Southwest Jiaotong University, Chengdu, 610031, China
| | - Xiangyu Peng
- School of Mechanical Engineering, Southwest Jiaotong University, Chengdu, 610031, China
| | - Jie Liu
- School of Mechanical Engineering, Southwest Jiaotong University, Chengdu, 610031, China
| | - Chaoming He
- School of Mechanical Engineering, Southwest Jiaotong University, Chengdu, 610031, China
| |
Collapse
|
8
|
Maskeliūnas R, Kulikajevas A, Damaševičius R, Pribuišis K, Ulozaitė-Stanienė N, Uloza V. Lightweight Deep Learning Model for Assessment of Substitution Voicing and Speech after Laryngeal Carcinoma Surgery. Cancers (Basel) 2022; 14:cancers14102366. [PMID: 35625971 PMCID: PMC9139213 DOI: 10.3390/cancers14102366] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 05/03/2022] [Accepted: 05/04/2022] [Indexed: 11/16/2022] Open
Abstract
Laryngeal carcinoma is the most common malignant tumor of the upper respiratory tract. Total laryngectomy provides complete and permanent detachment of the upper and lower airways that causes the loss of voice, leading to a patient's inability to verbally communicate in the postoperative period. This paper aims to exploit modern areas of deep learning research to objectively classify, extract and measure the substitution voicing after laryngeal oncosurgery from the audio signal. We propose using well-known convolutional neural networks (CNNs) applied for image classification for the analysis of voice audio signal. Our approach takes an input of Mel-frequency spectrogram (MFCC) as an input of deep neural network architecture. A database of digital speech recordings of 367 male subjects (279 normal speech samples and 88 pathological speech samples) was used. Our approach has shown the best true-positive rate of any of the compared state-of-the-art approaches, achieving an overall accuracy of 89.47%.
Collapse
Affiliation(s)
- Rytis Maskeliūnas
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
| | - Audrius Kulikajevas
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
| | - Robertas Damaševičius
- Faculty of Informatics, Kaunas University of Technology, 51368 Kaunas, Lithuania; (R.M.); (A.K.)
- Correspondence:
| | - Kipras Pribuišis
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| | - Nora Ulozaitė-Stanienė
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| | - Virgilijus Uloza
- Department of Otorhinolaryngology, Lithuanian University of Health Sciences, 50061 Kaunas, Lithuania; (K.P.); (N.U.-S.); (V.U.)
| |
Collapse
|