1
|
Camarena F, Gonzalez-Mendoza M, Chang L. Knowledge Distillation in Video-Based Human Action Recognition: An Intuitive Approach to Efficient and Flexible Model Training. J Imaging 2024; 10:85. [PMID: 38667983 PMCID: PMC11051277 DOI: 10.3390/jimaging10040085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 03/23/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Training a model to recognize human actions in videos is computationally intensive. While modern strategies employ transfer learning methods to make the process more efficient, they still face challenges regarding flexibility and efficiency. Existing solutions are limited in functionality and rely heavily on pretrained architectures, which can restrict their applicability to diverse scenarios. Our work explores knowledge distillation (KD) for enhancing the training of self-supervised video models in three aspects: improving classification accuracy, accelerating model convergence, and increasing model flexibility under regular and limited-data scenarios. We tested our method on the UCF101 dataset using differently balanced proportions: 100%, 50%, 25%, and 2%. We found that using knowledge distillation to guide the model's training outperforms traditional training without affecting the classification accuracy and while reducing the convergence rate of model training in standard settings and a data-scarce environment. Additionally, knowledge distillation enables cross-architecture flexibility, allowing model customization for various applications: from resource-limited to high-performance scenarios.
Collapse
Affiliation(s)
- Fernando Camarena
- School of Engineering and Science, Tecnologico de Monterrey, Nuevo León 64700, Mexico
| | | | | |
Collapse
|
2
|
Argade D, Khairnar V, Vora D, Patil S, Kotecha K, Alfarhood S. Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism. Heliyon 2024; 10:e26162. [PMID: 38420442 PMCID: PMC10900395 DOI: 10.1016/j.heliyon.2024.e26162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 01/28/2024] [Accepted: 02/08/2024] [Indexed: 03/02/2024] Open
Abstract
In recent decades, abstractive text summarization using multimodal input has attracted many researchers due to the capability of gathering information from various sources to create a concise summary. However, the existing methodologies based on multimodal summarization provide only a summary for the short videos and poor results for the lengthy videos. To address the aforementioned issues, this research presented the Multimodal Abstractive Summarization using Bidirectional Encoder Representations from Transformers (MAS-BERT) with an attention mechanism. The purpose of the video summarization is to increase the speed of searching for a large collection of videos so that the users can quickly decide whether the video is relevant or not by reading the summary. Initially, the data is obtained from the publicly available How2 dataset and is encoded using the Bidirectional Gated Recurrent Unit (Bi-GRU) encoder and the Long Short Term Memory (LSTM) encoder. The textual data which is embedded in the embedding layer is encoded using a bidirectional GRU encoder and the features with audio and video data are encoded with LSTM encoder. After this, BERT based attention mechanism is used to combine the modalities and finally, the BI-GRU based decoder is used for summarizing the multimodalities. The results obtained through the experiments that show the proposed MAS-BERT has achieved a better result of 60.2 for Rouge-1 whereas, the existing Decoder-only Multimodal Transformer (D-MmT) and the Factorized Multimodal Transformer based Decoder Only Language model (FLORAL) has achieved 49.58 and 56.89 respectively. Our work facilitates users by providing better contextual information and user experience and would help video-sharing platforms for customer retention by allowing users to search for relevant videos by looking at its summary.
Collapse
Affiliation(s)
- Dakshata Argade
- Terna Engineering College, Nerul, Navi Mumbai, 400706, India
| | | | - Deepali Vora
- Symbiosis Institute of Technology, Pune Campus, Symbiosis International (Deemed University), Pune, 412115, India
| | - Shruti Patil
- Symbiosis Institute of Technology, Pune Campus, Symbiosis International (Deemed University), Pune, 412115, India
- Symbiosis Centre for Applied Artificial Intelligence (SCAAI), Symbiosis Institute of Technology Pune Campus, Symbiosis International (Deemed University) (SIU), Lavale, Pune, 412115, India
| | - Ketan Kotecha
- Symbiosis Centre for Applied Artificial Intelligence (SCAAI), Symbiosis Institute of Technology Pune Campus, Symbiosis International (Deemed University) (SIU), Lavale, Pune, 412115, India
| | - Sultan Alfarhood
- Department of Computer Science, College of Computer and Information Sciences, King Saud University, P.O.Box 51178, Riyadh, 11543, Saudi Arabia
| |
Collapse
|
3
|
Paramasivam K, Sindha MMR, Balakrishnan SB. KNN-Based Machine Learning Classifier Used on Deep Learned Spatial Motion Features for Human Action Recognition. ENTROPY (BASEL, SWITZERLAND) 2023; 25:844. [PMID: 37372188 DOI: 10.3390/e25060844] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 05/04/2023] [Accepted: 05/09/2023] [Indexed: 06/29/2023]
Abstract
Human action recognition is an essential process in surveillance video analysis, which is used to understand the behavior of people to ensure safety. Most of the existing methods for HAR use computationally heavy networks such as 3D CNN and two-stream networks. To alleviate the challenges in the implementation and training of 3D deep learning networks, which have more parameters, a customized lightweight directed acyclic graph-based residual 2D CNN with fewer parameters was designed from scratch and named HARNet. A novel pipeline for the construction of spatial motion data from raw video input is presented for the latent representation learning of human actions. The constructed input is fed to the network for simultaneous operation over spatial and motion information in a single stream, and the latent representation learned at the fully connected layer is extracted and fed to the conventional machine learning classifiers for action recognition. The proposed work was empirically verified, and the experimental results were compared with those for existing methods. The results show that the proposed method outperforms state-of-the-art (SOTA) methods with a percentage improvement of 2.75% on UCF101, 10.94% on HMDB51, and 0.18% on the KTH dataset.
Collapse
Affiliation(s)
- Kalaivani Paramasivam
- Department of Electronics and Communication Engineering, Government College of Engineering, Bodinayakanur 625582, Tamilnadu, India
| | - Mohamed Mansoor Roomi Sindha
- Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai 625015, Tamilnadu, India
| | - Sathya Bama Balakrishnan
- Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai 625015, Tamilnadu, India
| |
Collapse
|
4
|
Ottakath N, Al-Maadeed S. Vehicle Instance Segmentation Polygonal Dataset for a Private Surveillance System. SENSORS (BASEL, SWITZERLAND) 2023; 23:3642. [PMID: 37050701 PMCID: PMC10098633 DOI: 10.3390/s23073642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 03/19/2023] [Accepted: 03/20/2023] [Indexed: 06/19/2023]
Abstract
Vehicle identification and re-identification is an essential tool for traffic surveillance. However, with cameras at every corner of the street, there is a requirement for private surveillance. Automated surveillance can be achieved through computer vision tasks such as segmentation of the vehicle, classification of the make and model of the vehicle and license plate detection. To achieve a unique representation of every vehicle on the road with just the region of interest extracted, instance segmentation is applied. With the frontal part of the vehicle segmented for privacy, the vehicle make is identified along with the license plate. To achieve this, a dataset is annotated with a polygonal bounding box of its frontal region and license plate localization. State-of-the-art methods, maskRCNN, is utilized to identify the best performing model. Further, data augmentation using multiple techniques is evaluated for better generalization of the dataset. The results showed improved classification as well as a high mAP for the dataset when compared to previous approaches on the same dataset. A classification accuracy of 99.2% was obtained and segmentation was achieved with a high mAP of 99.67%. Data augmentation approaches were employed to balance and generalize the dataset of which the mosaic-tiled approach produced higher accuracy.
Collapse
|
5
|
Atif O, Lee J, Park D, Chung Y. Behavior-Based Video Summarization System for Dog Health and Welfare Monitoring. SENSORS (BASEL, SWITZERLAND) 2023; 23:2892. [PMID: 36991606 PMCID: PMC10054391 DOI: 10.3390/s23062892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 03/02/2023] [Accepted: 03/04/2023] [Indexed: 06/19/2023]
Abstract
The popularity of dogs has been increasing owing to factors such as the physical and mental health benefits associated with raising them. While owners care about their dogs' health and welfare, it is difficult for them to assess these, and frequent veterinary checkups represent a growing financial burden. In this study, we propose a behavior-based video summarization and visualization system for monitoring a dog's behavioral patterns to help assess its health and welfare. The system proceeds in four modules: (1) a video data collection and preprocessing module; (2) an object detection-based module for retrieving image sequences where the dog is alone and cropping them to reduce background noise; (3) a dog behavior recognition module using two-stream EfficientNetV2 to extract appearance and motion features from the cropped images and their respective optical flow, followed by a long short-term memory (LSTM) model to recognize the dog's behaviors; and (4) a summarization and visualization module to provide effective visual summaries of the dog's location and behavior information to help assess and understand its health and welfare. The experimental results show that the system achieved an average F1 score of 0.955 for behavior recognition, with an execution time allowing real-time processing, while the summarization and visualization results demonstrate how the system can help owners assess and understand their dog's health and welfare.
Collapse
Affiliation(s)
- Othmane Atif
- Department of Computer and Information Science, Korea University, Sejong City 30019, Republic of Korea
| | - Jonguk Lee
- Department of Computer Convergence Software, Sejong Campus, Korea University, Sejong City 30019, Republic of Korea
| | - Daihee Park
- Department of Computer Convergence Software, Sejong Campus, Korea University, Sejong City 30019, Republic of Korea
| | - Yongwha Chung
- Department of Computer Convergence Software, Sejong Campus, Korea University, Sejong City 30019, Republic of Korea
| |
Collapse
|
6
|
Yue R, Tian Z, Du S. Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
|
7
|
Robust appearance modeling for object detection and tracking: a survey of deep learning approaches. PROGRESS IN ARTIFICIAL INTELLIGENCE 2022. [DOI: 10.1007/s13748-022-00290-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
8
|
Wang F, Chen J, Xie Z, Ai Y, Zhang W. Local sharpness failure detection of camera module lens based on image blur assessment. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03948-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
9
|
Arshad MH, Bilal M, Gani A. Human Activity Recognition: Review, Taxonomy and Open Challenges. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22176463. [PMID: 36080922 PMCID: PMC9460866 DOI: 10.3390/s22176463] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 08/23/2022] [Accepted: 08/24/2022] [Indexed: 06/12/2023]
Abstract
Nowadays, Human Activity Recognition (HAR) is being widely used in a variety of domains, and vision and sensor-based data enable cutting-edge technologies to detect, recognize, and monitor human activities. Several reviews and surveys on HAR have already been published, but due to the constantly growing literature, the status of HAR literature needed to be updated. Hence, this review aims to provide insights on the current state of the literature on HAR published since 2018. The ninety-five articles reviewed in this study are classified to highlight application areas, data sources, techniques, and open research challenges in HAR. The majority of existing research appears to have concentrated on daily living activities, followed by user activities based on individual and group-based activities. However, there is little literature on detecting real-time activities such as suspicious activity, surveillance, and healthcare. A major portion of existing studies has used Closed-Circuit Television (CCTV) videos and Mobile Sensors data. Convolutional Neural Network (CNN), Long short-term memory (LSTM), and Support Vector Machine (SVM) are the most prominent techniques in the literature reviewed that are being utilized for the task of HAR. Lastly, the limitations and open challenges that needed to be addressed are discussed.
Collapse
Affiliation(s)
- Muhammad Haseeb Arshad
- Department of Computer Science, National University of Computer and Emerging Sciences, Chiniot-Faisalabad Campus, Chiniot 35400, Pakistan
| | - Muhammad Bilal
- Department of Software Engineering, National University of Computer and Emerging Sciences, Chiniot-Faisalabad Campus, Chiniot 35400, Pakistan
| | - Abdullah Gani
- Faculty of Computing and Informatics, University Malaysia Sabah, Kota Kinabalu 88400, Sabah, Malaysia
| |
Collapse
|
10
|
Zhou D, Chen G, Xu F. Application of Deep Learning Technology in Strength Training of Football Players and Field Line Detection of Football Robots. Front Neurorobot 2022; 16:867028. [PMID: 35845757 PMCID: PMC9278879 DOI: 10.3389/fnbot.2022.867028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 04/19/2022] [Indexed: 11/25/2022] Open
Abstract
The purpose of the study is to improve the performance of intelligent football training. Based on deep learning (DL), the training of football players and detection of football robots are analyzed. First, the research status of the training of football players and football robots is introduced, and the basic structure of the neuron model and convolutional neural network (CNN) and the mainstream framework of DL are mainly expounded. Second, combined with the spatial stream network, a CNN-based action recognition system is constructed in the context of artificial intelligence (AI). Finally, by the football robot, a field line detection model based on a fully convolutional network (FCN) is proposed, and the effective applicability of the system is evaluated. The results demonstrate that the recognition effect of the dual-stream network is the best, reaching 92.8%. The recognition rate of the timestream network is lower than that of the dual-stream network, and the maximum recognition rate is 88%. The spatial stream network has the lowest recognition rate of 86.5%. The processing power of the four different algorithms on the dataset is stronger than that of the ordinary video set. The recognition rate of the time-segmented dual-stream fusion network is the highest, which is second only to the designed network. The recognition rate of the basic dual-stream network is 88.6%, and the recognition rate of the 3D CNN is the lowest, which is 86.2%. Under the intelligent training system, the recognition accuracy rates of jumping, kicking, grabbing, and starting actions range to 97.6, 94.5, 92.5, and 89.8% respectively, which are slightly lower than other actions. The recognition accuracy rate of passing action is 91.3%, and the maximum upgrade rate of intelligent training is 25.7%. The pixel accuracy of the improved field line detection of the model and the mean intersection over union (MIoU) are both improved by 5%. Intelligent training systems and the field line detection of football robots are more feasible. The research provides a reference for the development of AI in the field of sports training.
Collapse
Affiliation(s)
- Daliang Zhou
- School of PE, Nanjing Xiaozhuang University, Nanjing, China
| | - Gang Chen
- School of PE, Nanjing Xiaozhuang University, Nanjing, China
| | - Fei Xu
- School of Physical Education, Hangzhou Normal University, Hangzhou, China
| |
Collapse
|
11
|
The Design of the Lightweight Smart Home System and Interaction Experience of Products for Middle-Aged and Elderly Users in Smart Cities. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:1279351. [PMID: 35755765 PMCID: PMC9217567 DOI: 10.1155/2022/1279351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 04/07/2022] [Accepted: 04/29/2022] [Indexed: 11/18/2022]
Abstract
The research aims to improve the comfort and safety of the smart home by adding a motion recognition algorithm to the smart home system. First, the research status of motion recognition is introduced. Second, based on the requirements of the smart home system, a smart home system is designed for middle-aged and elderly users. The software system in this system includes intelligent control subsystems, intelligent monitoring subsystems, and intelligent protection subsystems. Finally, to increase the security of the smart home, the intelligent monitoring subsystem is improved, and an intelligent security subsystem is proposed based on a small-scale motion detection algorithm. The system uses three three-dimensional (3D) convolutional neural networks (CNNs) to extract three image features, so that the data information in the video can be fully extracted. The performance of the proposed intelligent security subsystem based on a small-scale motion detection algorithm is compared and analyzed. The research results show that the accuracy of the system on the University of Central Florida (UCF101) dataset is 94.64%, and the accuracy on the HMDB51 dataset is 90.11%, which is similar to other advanced algorithms. Observing whether there are dangers such as falling inside and outside the family through motion recognition technology has very important application significance for protecting people's personal safety, life, and health.
Collapse
|
12
|
Xing Y, Zhu J, Li Y, Huang J, Song J. An improved spatial temporal graph convolutional network for robust skeleton-based action recognition. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03589-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
13
|
Teng Y, Song C, Wu B. Toward jointly understanding social relationships and characters from videos. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02738-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
14
|
Lee I, Kim D, Wee D, Lee S. An Efficient Human Instance-Guided Framework for Video Action Recognition. SENSORS (BASEL, SWITZERLAND) 2021; 21:8309. [PMID: 34960404 PMCID: PMC8709376 DOI: 10.3390/s21248309] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 12/02/2021] [Accepted: 12/10/2021] [Indexed: 11/25/2022]
Abstract
In recent years, human action recognition has been studied by many computer vision researchers. Recent studies have attempted to use two-stream networks using appearance and motion features, but most of these approaches focused on clip-level video action recognition. In contrast to traditional methods which generally used entire images, we propose a new human instance-level video action recognition framework. In this framework, we represent the instance-level features using human boxes and keypoints, and our action region features are used as the inputs of the temporal action head network, which makes our framework more discriminative. We also propose novel temporal action head networks consisting of various modules, which reflect various temporal dynamics well. In the experiment, the proposed models achieve comparable performance with the state-of-the-art approaches on two challenging datasets. Furthermore, we evaluate the proposed features and networks to verify the effectiveness of them. Finally, we analyze the confusion matrix and visualize the recognized actions at human instance level when there are several people.
Collapse
Affiliation(s)
- Inwoong Lee
- Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea; or (D.K.)
- Clova AI Research, NAVER Corporation, Seongnam 13561, Korea;
| | - Doyoung Kim
- Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea; or (D.K.)
| | - Dongyoon Wee
- Clova AI Research, NAVER Corporation, Seongnam 13561, Korea;
| | - Sanghoon Lee
- Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea; or (D.K.)
- Department of Radiology, College of Medicine, Yonsei University, Seoul 03722, Korea
| |
Collapse
|
15
|
Al-Ali A, Elharrouss O, Qidwai U, Al-Maaddeed S. ANFIS-Net for automatic detection of COVID-19. Sci Rep 2021; 11:17318. [PMID: 34453082 PMCID: PMC8397755 DOI: 10.1038/s41598-021-96601-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 08/04/2021] [Indexed: 12/24/2022] Open
Abstract
Among the most leading causes of mortality across the globe are infectious diseases which have cost tremendous lives with the latest being coronavirus (COVID-19) that has become the most recent challenging issue. The extreme nature of this infectious virus and its ability to spread without control has made it mandatory to find an efficient auto-diagnosis system to assist the people who work in touch with the patients. As fuzzy logic is considered a powerful technique for modeling vagueness in medical practice, an Adaptive Neuro-Fuzzy Inference System (ANFIS) was proposed in this paper as a key rule for automatic COVID-19 detection from chest X-ray images based on the characteristics derived by texture analysis using gray level co-occurrence matrix (GLCM) technique. Unlike the proposed method, especially deep learning-based approaches, the proposed ANFIS-based method can work on small datasets. The results were promising performance accuracy, and compared with the other state-of-the-art techniques, the proposed method gives the same performance as the deep learning with complex architectures using many backbone.
Collapse
Affiliation(s)
- Afnan Al-Ali
- Department of Computer Science and Engineering, Qatar University, Doha, Qatar.
| | - Omar Elharrouss
- Department of Computer Science and Engineering, Qatar University, Doha, Qatar
| | - Uvais Qidwai
- Department of Computer Science and Engineering, Qatar University, Doha, Qatar
| | - Somaya Al-Maaddeed
- Department of Computer Science and Engineering, Qatar University, Doha, Qatar
| |
Collapse
|
16
|
Applications, databases and open computer vision research from drone videos and images: a survey. Artif Intell Rev 2021. [DOI: 10.1007/s10462-020-09943-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|