1
|
Kwon J, Hwang J, Sung JE, Im CH. Speech synthesis from three-axis accelerometer signals using conformer-based deep neural network. Comput Biol Med 2024; 182:109090. [PMID: 39232406 DOI: 10.1016/j.compbiomed.2024.109090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 08/23/2024] [Accepted: 08/29/2024] [Indexed: 09/06/2024]
Abstract
Silent speech interfaces (SSIs) have emerged as innovative non-acoustic communication methods, and our previous study demonstrated the significant potential of three-axis accelerometer-based SSIs to identify silently spoken words with high classification accuracy. The developed accelerometer-based SSI with only four accelerometers and a small training dataset outperformed a conventional surface electromyography (sEMG)-based SSI. In this study, motivated by the promising initial results, we investigated the feasibility of synthesizing spoken speech from three-axis accelerometer signals. This exploration aimed to assess the potential of accelerometer-based SSIs for practical silent communication applications. Nineteen healthy individuals participated in our experiments. Five accelerometers were attached to the face to acquire speech-related facial movements while the participants read 270 Korean sentences aloud. For the speech synthesis, we used a convolution-augmented Transformer (Conformer)-based deep neural network model to convert the accelerometer signals into a Mel spectrogram, from which an audio waveform was synthesized using HiFi-GAN. To evaluate the quality of the generated Mel spectrograms, ten-fold cross-validation was performed, and the Mel cepstral distortion (MCD) was chosen as the evaluation metric. As a result, an average MCD of 5.03 ± 0.65 was achieved using four optimized accelerometers based on our previous study. Furthermore, the quality of generated Mel spectrograms was significantly enhanced by adding one more accelerometer attached under the chin, achieving an average MCD of 4.86 ± 0.65 (p < 0.001, Wilcoxon signed-rank test). Although an objective comparison is difficult, these results surpass those obtained using conventional SSIs based on sEMG, electromagnetic articulography, and electropalatography with the fewest sensors and a similar or smaller number of sentences to train the model. Our proposed approach will contribute to the widespread adoption of accelerometer-based SSIs, leveraging the advantages of accelerometers like low power consumption, invulnerability to physiological artifacts, and high portability.
Collapse
Affiliation(s)
- Jinuk Kwon
- Department of Electronic Engineering, Hanyang University, Seoul, South Korea.
| | - Jihun Hwang
- Department of Electronic Engineering, Hanyang University, Seoul, South Korea.
| | - Jee Eun Sung
- Department of Communication Disorders, Ewha Womans University, Seoul, South Korea.
| | - Chang-Hwan Im
- Department of Electronic Engineering, Hanyang University, Seoul, South Korea; Department of Biomedical Engineering, Hanyang University, Seoul, South Korea; Department of Artificial Intelligence, Hanyang University, Seoul, South Korea; Department of HY-KIST Bio-Convergence, Hanyang University, Seoul, South Korea.
| |
Collapse
|
2
|
Cao H, Xu Y, Mao K, Xie L, Yin J, See S, Xu Q, Yang J. Self-Supervised Video Representation Learning by Video Incoherence Detection. IEEE TRANSACTIONS ON CYBERNETICS 2024; 54:3810-3822. [PMID: 37079425 DOI: 10.1109/tcyb.2023.3265393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
This article introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It stems from the observation that the visual system of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, we construct the incoherent clip by multiple subclips hierarchically sampled from the same raw video with various lengths of incoherence. The network is trained to learn the high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, we introduce intravideo contrastive learning to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval using various backbone networks. Experiments show that our proposed method achieves remarkable performance across different backbone networks and different datasets compared to previous coherence-based methods.
Collapse
|
3
|
Chen T, Hong R, Guo Y, Hao S, Hu B. MS²-GNN: Exploring GNN-Based Multimodal Fusion Network for Depression Detection. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:7749-7759. [PMID: 36194716 DOI: 10.1109/tcyb.2022.3197127] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Major depressive disorder (MDD) is one of the most common and severe mental illnesses, posing a huge burden on society and families. Recently, some multimodal methods have been proposed to learn a multimodal embedding for MDD detection and achieved promising performance. However, these methods ignore the heterogeneity/homogeneity among various modalities. Besides, earlier attempts ignore interclass separability and intraclass compactness. Inspired by the above observations, we propose a graph neural network (GNN)-based multimodal fusion strategy named modal-shared modal-specific GNN, which investigates the heterogeneity/homogeneity among various psychophysiological modalities as well as explores the potential relationship between subjects. Specifically, we develop a modal-shared and modal-specific GNN architecture to extract the inter/intramodal characteristics. Furthermore, a reconstruction network is employed to ensure fidelity within the individual modality. Moreover, we impose an attention mechanism on various embeddings to obtain a multimodal compact representation for the subsequent MDD detection task. We conduct extensive experiments on two public depression datasets and the favorable results demonstrate the effectiveness of the proposed algorithm.
Collapse
|
4
|
Wang Q, Huang W, Zhang X, Li X. GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:6910-6922. [PMID: 36446004 DOI: 10.1109/tcyb.2022.3222606] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Remote sensing image captioning (RSIC), which describes a remote sensing image with a semantically related sentence, has been a cross-modal challenge between computer vision and natural language processing. For visual features extracted from remote sensing images, global features provide the complete and comprehensive visual relevance of all the words of a sentence simultaneously, while local features can emphasize the discrimination of these words individually. Therefore, not only global features are important for caption generation but also local features are meaningful for making the words more discriminative. In order to make full use of the advantages of both global and local features, in this article, we propose an attention-based global-local captioning model (GLCM) to obtain global-local visual feature representation for RSIC. Based on the proposed GLCM, the correlation of all the generated words and the relation of each separate word and the most related local visual features can be visualized in a similarity-based manner, which provides more interpretability for RSIC. In the extensive experiments, our method achieves comparable results in UCM-captions and superior results in Sydney-captions and RSICD which is the largest RSIC dataset.
Collapse
|
5
|
Wu C, Zhang Y, Nie S, Hong D, Zhu J, Chen Z, Liu B, Liu H, Yang Q, Li H, Xu G, Weng J, Kong Y, Wan Q, Zha Y, Chen C, Xu H, Hu Y, Shi Y, Zhou Y, Su G, Tang Y, Gong M, Wang L, Hou F, Liu Y, Li G. Predicting in-hospital outcomes of patients with acute kidney injury. Nat Commun 2023; 14:3739. [PMID: 37349292 PMCID: PMC10287760 DOI: 10.1038/s41467-023-39474-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2022] [Accepted: 06/15/2023] [Indexed: 06/24/2023] Open
Abstract
Acute kidney injury (AKI) is prevalent and a leading cause of in-hospital death worldwide. Early prediction of AKI-related clinical events and timely intervention for high-risk patients could improve outcomes. We develop a deep learning model based on a nationwide multicenter cooperative network across China that includes 7,084,339 hospitalized patients, to dynamically predict the risk of in-hospital death (primary outcome) and dialysis (secondary outcome) for patients who developed AKI during hospitalization. A total of 137,084 eligible patients with AKI constitute the analysis set. In the derivation cohort, the area under the receiver operator curve (AUROC) for 24-h, 48-h, 72-h, and 7-day death are 95·05%, 94·23%, 93·53%, and 93·09%, respectively. For dialysis outcome, the AUROC of each time span are 88·32%, 83·31%, 83·20%, and 77·99%, respectively. The predictive performance is consistent in both internal and external validation cohorts. The model can predict important outcomes of patients with AKI, which could be helpful for the early management of AKI.
Collapse
Affiliation(s)
- Changwei Wu
- Department of Nephrology and Nephrology Institute, Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, 610072, Chengdu, China
| | - Yun Zhang
- Knowledge and Data Engineering Laboratory of Chinese Medicine, School of Information and Software Engineering, University of Electronic Science and Technology of China, 610054, Chengdu, China
| | - Sheng Nie
- National Clinical Research Center for Kidney Disease, State Laboratory of Organ Failure Research, Division of Nephrology, Nanfang Hospital, Southern Medical University, 510515, Guangzhou, China
| | - Daqing Hong
- Department of Nephrology and Nephrology Institute, Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, 610072, Chengdu, China
| | - Jiajing Zhu
- Knowledge and Data Engineering Laboratory of Chinese Medicine, School of Information and Software Engineering, University of Electronic Science and Technology of China, 610054, Chengdu, China
| | - Zhi Chen
- Knowledge and Data Engineering Laboratory of Chinese Medicine, School of Information and Software Engineering, University of Electronic Science and Technology of China, 610054, Chengdu, China
| | - Bicheng Liu
- Institute of Nephrology, Zhongda Hospital, Southeast University School of Medicine, 210000, Nanjing, China
| | - Huafeng Liu
- Key Laboratory of Prevention and Management of Chronic Kidney Disease of Zhanjiang City, Institute of Nephrology, Affiliated Hospital of Guangdong Medical University, 524000, Zhanjiang, China
| | - Qiongqiong Yang
- Department of Nephrology, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, 510515, Guangzhou, China
| | - Hua Li
- Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, 310000, Hangzhou, China
| | - Gang Xu
- Division of Nephrology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, 430000, Wuhan, China
| | - Jianping Weng
- Department of Endocrinology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, 230000, Hefei, China
| | - Yaozhong Kong
- Department of Nephrology, the First People's Hospital of Foshan, 528000, Foshan, China
| | - Qijun Wan
- The Second People's Hospital of Shenzhen, Shenzhen University, 518000, Shenzhen, China
| | - Yan Zha
- Guizhou Provincial People's Hospital, Guizhou University, 550000, Guiyang, China
| | - Chunbo Chen
- Department of Critical Care Medicine, Maoming People's Hospital, 525000, Maoming, China
| | - Hong Xu
- Children's Hospital of Fudan University, 200000, Shanghai, China
| | - Ying Hu
- The Second Affiliated Hospital of Zhejiang University School of Medicine, 310000, Hangzhou, China
| | - Yongjun Shi
- Huizhou Municipal Central Hospital, Sun Yat-Sen University, 516000, Huizhou, China
| | - Yilun Zhou
- Department of Nephrology, Beijing Tiantan Hospital, Capital Medical University, 100000, Beijing, China
| | - Guobin Su
- Department of Nephrology, Guangdong Provincial Hospital of Chinese Medicine, The Second Affiliated Hospital, The Second Clinical College, Guangzhou University of Chinese Medicine, 510000, Guangzhou, China
| | - Ying Tang
- The Third Affiliated Hospital of Southern Medical University, 510000, Guangzhou, China
| | - Mengchun Gong
- Institute of Health Management, Southern Medical University, 510000, Guangzhou, China
- DHC Technologies, 100000, Beijing, China
| | - Li Wang
- Department of Nephrology and Nephrology Institute, Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, 610072, Chengdu, China
| | - Fanfan Hou
- National Clinical Research Center for Kidney Disease, State Laboratory of Organ Failure Research, Division of Nephrology, Nanfang Hospital, Southern Medical University, 510515, Guangzhou, China.
| | - Yongguo Liu
- Knowledge and Data Engineering Laboratory of Chinese Medicine, School of Information and Software Engineering, University of Electronic Science and Technology of China, 610054, Chengdu, China.
| | - Guisen Li
- Department of Nephrology and Nephrology Institute, Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, 610072, Chengdu, China.
| |
Collapse
|
6
|
Gao X, Yan L, Wang G, Gerada C. Hybrid Recurrent Neural Network Architecture-Based Intention Recognition for Human-Robot Collaboration. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:1578-1586. [PMID: 34637387 DOI: 10.1109/tcyb.2021.3106543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Human-robot-collaboration requires robot to proactively and intelligently recognize the intention of human operator. Despite deep learning approaches have achieved certain results in performing feature learning and long-term temporal dependencies modeling, the motion prediction is still not desirable enough, which unavoidably compromises the accomplishment of tasks. Therefore, a hybrid recurrent neural network architecture is proposed for intention recognition to conduct the assembly tasks cooperatively. Specifically, the improved LSTM (ILSTM) and improved Bi-LSTM (IBi-LSTM) networks are first explored with state activation function and gate activation function to improve the network performance. The employment of the IBi-LSTM unit in the first layers of the hybrid architecture helps to learn the features effectively and fully from complex sequential data, and the LSTM-based cell in the last layer contributes to capturing the forward dependency. This hybrid network architecture can improve the prediction performance of intention recognition effectively. One experimental platform with the UR5 collaborative robot and human motion capture device is set up to test the performance of the proposed method. One filter, that is, the quartile-based amplitude limiting algorithm in sliding window, is designed to deal with the abnormal data of the spatiotemporal data, and thus, to improve the accuracy of network training and testing. The experimental results show that the hybrid network can predict the motion of human operator more precisely in collaborative workspace, compared with some representative deep learning methods.
Collapse
|
7
|
Shalini R, Gopi VP. Deep learning approaches based improved light weight U-Net with attention module for optic disc segmentation. Phys Eng Sci Med 2022; 45:1111-1122. [PMID: 36094722 DOI: 10.1007/s13246-022-01178-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Accepted: 09/05/2022] [Indexed: 12/15/2022]
Abstract
Glaucoma is a major cause of blindness worldwide, and its early detection is essential for the timely management of the condition. Glaucoma-induced anomalies of the optic nerve head may cause variation in the Optic Disc (OD) size. Therefore, robust OD segmentation techniques are necessary for the screening for glaucoma. Computer-aided segmentation has become a promising diagnostic tool for the early detection of glaucoma, and there has been much interest in recent years in using neural networks for medical image segmentation. This study proposed an enhanced lightweight U-Net model with an Attention Gate (AG) to segment OD images. We also used a transfer learning strategy to extract relevant features using a pre-trained EfficientNet-B0 CNN, which preserved the receptive field size and AG, which reduced the impact of gradient vanishing and overfitting. Additionally, the neural network trained using the binary focal loss function improved segmentation accuracy. The pre-trained Attention U-Net was validated using publicly available datasets, such as DRIONS-DB, DRISHTI-GS, and MESSIDOR. The model significantly reduced parameter quantity by around 0.53 M and had inference times of 40.3 ms, 44.2 ms, and 60.6 ms, respectively.
Collapse
Affiliation(s)
- R Shalini
- Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, Tamilnadu, 620015, India
| | - Varun P Gopi
- Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, Tamilnadu, 620015, India.
| |
Collapse
|
8
|
Shahir Zaoad M, Rushadul Mannan M, Mandol AB, Rahman M, Adnanul Islam M, Mahbubur Rahman M. An Attention-Based Hybrid Deep Learning Approach For Bengali Video Captioning. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2022. [DOI: 10.1016/j.jksuci.2022.11.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
9
|
Zhao Y. Risk Prediction for Internet Financial Enterprises by Deep Learning Algorithm and Sustainable Development of Business Transformation. JOURNAL OF GLOBAL INFORMATION MANAGEMENT 2022. [DOI: 10.4018/jgim.300741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
It is necessary to find new ideas of business transformation of traditional financial enterprises under the background of Internet finance. Based on DL (deep learning) algorithm, the BPNN (Back Propagation neural network) model and Vector Autoregression model are used to analyze the business conflict of commercial banks among traditional financial enterprises under Internet finance. The business integration point of the two is found through the impulse response analysis of the impact of the Internet financial business on the traditional financial industry. Then, the DL algorithm based on BPNN is used to obtain the optimal solution of business integration, to promote the transformation of traditional financial services under the background of Internet finance. The results show that there is a close correlation between Internet finance and traditional financial business. The initial conflicts between the two are serious, but as time passes, they have a trend of mutual integration.
Collapse
|
10
|
Chen D, Chen L, Zhang Y, Wen B, Yang C. A Multiscale Interactive Recurrent Network for Time-Series Forecasting. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:8793-8803. [PMID: 33710967 DOI: 10.1109/tcyb.2021.3055951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Time-series forecasting is a key component in the automation and optimization of intelligent applications. It is not a trivial task, as there are various short-term and/or long-term temporal dependencies. Multiscale modeling has been considered as a promising strategy to solve this problem. However, the existing multiscale models either apply an implicit way to model the temporal dependencies or ignore the interrelationships between multiscale subseries. In this article, we propose a multiscale interactive recurrent network (MiRNN) to jointly capture multiscale patterns. MiRNN employs a deep wavelet decomposition network to decompose the raw time series into multiscale subseries. MiRNN introduces three key strategies (truncation, initialization, and message passing) to model the inherent interrelationships between multiscale subseries, as well as a dual-stage attention mechanism to capture multiscale temporal dependencies. Experiments on four real-world datasets demonstrate that our model achieves promising performance compared with the state-of-the-art methods.
Collapse
|
11
|
Bu K, Liu Y, Wang F. Operating performance assessment based on multi-source heterogeneous information with deep learning for smelting process of electro-fused magnesium furnace. ISA TRANSACTIONS 2022; 128:357-371. [PMID: 34776227 DOI: 10.1016/j.isatra.2021.10.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 09/25/2021] [Accepted: 10/22/2021] [Indexed: 06/13/2023]
Abstract
The process operating performance assessment is critical for the smelting process of electro-fused magnesium furnaces to improve quality of the magnesia product and pursue optimal comprehensive economic benefit. This paper proposes a new method of multi-source heterogeneous information deep feature fusion (MSHIDFF) to achieve higher accuracy operating performance assessment in the electro-fused magnesium smelting process. Firstly, we utilize convolutional neural network, bidirectional long short-term memory network and stacked auto-encoder to extract deep features from raw image, sound and current of different performance grades. Furthermore, those multi-source deep features are fused and the softmax regression with attention mechanism is employed to train a neural network classifier for the fused deep features of different performance grades. The simulation results show that the proposed MSHIDFF method obtains the superior assessment accuracy.
Collapse
Affiliation(s)
- Kaiqing Bu
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110819, China.
| | - Yan Liu
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110819, China.
| | - Fuli Wang
- State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang, Liaoning, 110819, China; College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110819, China.
| |
Collapse
|
12
|
Wang Q, Liu F, Zhao X, Tan Q. A CTR prediction model based on session interest. PLoS One 2022; 17:e0273048. [PMID: 35976962 PMCID: PMC9385038 DOI: 10.1371/journal.pone.0273048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Accepted: 08/01/2022] [Indexed: 11/30/2022] Open
Abstract
Click-through rate prediction has become a hot research direction in the field of advertising. It is important to build an effective CTR prediction model. However, most existing models ignore the factor that the sequence is composed of sessions, and the user behaviors are highly correlated in each session and are not relevant across sessions. In this paper, we focus on user multiple session interest and propose a hierarchical model based on session interest (SIHM) for CTR prediction. First, we divide the user sequential behavior into session layer. Then, we employ a self-attention network obtain an accurate expression of interest for each session. Since different session interest may be related to each other or follow a sequential pattern, next, we utilize bidirectional long short-term memory network (BLSTM) to capture the interaction of different session interests. Finally, the attention mechanism based LSTM (A-LSTM) is used to aggregate their target ad to find the influences of different session interests. Experimental results show that the model performs better than other models.
Collapse
Affiliation(s)
| | - Fang’ai Liu
- Shandong Normal University, Jinan, China
- * E-mail:
| | | | | |
Collapse
|
13
|
Zhou X, Feng J, Wang J, Pan J. Privacy-preserving household load forecasting based on non-intrusive load monitoring: A federated deep learning approach. PeerJ Comput Sci 2022; 8:e1049. [PMID: 36092014 PMCID: PMC9455055 DOI: 10.7717/peerj-cs.1049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Accepted: 06/29/2022] [Indexed: 05/06/2023]
Abstract
Load forecasting is very essential in the analysis and grid planning of power systems. For this reason, we first propose a household load forecasting method based on federated deep learning and non-intrusive load monitoring (NILM). As far as we know, this is the first research on federated learning (FL) in household load forecasting based on NILM. In this method, the integrated power is decomposed into individual device power by non-intrusive load monitoring, and the power of individual appliances is predicted separately using a federated deep learning model. Finally, the predicted power values of individual appliances are aggregated to form the total power prediction. Specifically, by separately predicting the electrical equipment to obtain the predicted power, it avoids the error caused by the strong time dependence in the power signal of a single device. In the federated deep learning prediction model, the household owners with the power data share the parameters of the local model instead of the local power data, guaranteeing the privacy of the household user data. The case results demonstrate that the proposed approach provides a better prediction effect than the traditional methodology that directly predicts the aggregated signal as a whole. In addition, experiments in various federated learning environments are designed and implemented to validate the validity of this methodology.
Collapse
Affiliation(s)
- Xinxin Zhou
- School of Computer Science, Northeast Electric Power University, jilin, China
| | - Jingru Feng
- School of Computer Science, Northeast Electric Power University, jilin, China
| | - Jian Wang
- School of Computer Science, Northeast Electric Power University, jilin, China
| | - Jianhong Pan
- State Grid Jilin Electric Power Company Limited, Changchun, China
| |
Collapse
|
14
|
Cheng J, Wang L, Wu J, Hu X, Jeon G, Tao D, Zhou M. Visual Relationship Detection: A Survey. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:8453-8466. [PMID: 35077387 DOI: 10.1109/tcyb.2022.3142013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Visual relationship detection (VRD) is one newly developed computer vision task, aiming to recognize relations or interactions between objects in an image. It is a further learning task after object recognition, and is important for fully understanding images even the visual world. It has numerous applications, such as image retrieval, machine vision in robotics, visual question answer (VQA), and visual reasoning. However, this problem is difficult since relationships are not definite, and the number of possible relations is much larger than objects. So the complete annotation for visual relationships is much more difficult, making this task hard to learn. Many approaches have been proposed to tackle this problem especially with the development of deep neural networks in recent years. In this survey, we first introduce the background of visual relations. Then, we present categorization and frameworks of deep learning models for visual relationship detection. The high-level applications, benchmark datasets, as well as empirical analysis are also introduced for comprehensive understanding of this task.
Collapse
|
15
|
Huo L, Bai L, Zhou SM. Automatically Generating Natural Language Descriptions of Images by a Deep Hierarchical Framework. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:7441-7452. [PMID: 33400668 DOI: 10.1109/tcyb.2020.3041595] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Automatically generating an accurate and meaningful description of an image is very challenging. However, the recent scheme of generating an image caption by maximizing the likelihood of target sentences lacks the capacity of recognizing the human-object interaction (HOI) and semantic relationship between HOIs and scenes, which are the essential parts of an image caption. This article proposes a novel two-phase framework to generate an image caption by addressing the above challenges: 1) a hybrid deep learning and 2) an image description generation. In the hybrid deep-learning phase, a novel factored three-way interaction machine was proposed to learn the relational features of the human-object pairs hierarchically. In this way, the image recognition problem is transformed into a latent structured labeling task. In the image description generation phase, a lexicalized probabilistic context-free tree growing scheme is innovatively integrated with a description generator to transform the descriptions generation task into a syntactic-tree generation process. Extensively comparing state-of-the-art image captioning methods on benchmark datasets, we demonstrated that our proposed framework outperformed the existing captioning methods in different ways, such as significantly improving the performance of the HOI and relationships between HOIs and scenes (RHIS) predictions, and quality of generated image captions in a semantically and structurally coherent manner.
Collapse
|
16
|
Comparison between Physical and Empirical Methods for Simulating Surface Brightness Temperature Time Series. REMOTE SENSING 2022. [DOI: 10.3390/rs14143385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Land surface temperature (LST) is a vital parameter in the surface energy budget and water cycle. One of the most important foundations for LST studies is a theory to understand how to model LST with various influencing factors, such as canopy structure, solar radiation, and atmospheric conditions. Both physical-based and empirical methods have been widely applied. However, few studies have compared these two categories of methods. In this paper, a physical-based method, soil canopy observation of photochemistry and energy fluxes (SCOPE), and two empirical methods, random forest (RF) and long short-term memory (LSTM), were selected as representatives for comparison. Based on a series of measurements from meteorological stations in the Heihe River Basin, these methods were evaluated in different dimensions, i.e., the difference within the same surface type, between different years, and between different climate types. The comparison results indicate a relatively stable performance of SCOPE with a root mean square error (RMSE) of approximately 2.0 K regardless of surface types and years but requires many inputs and a high computational cost. The empirical methods performed relatively well in dealing with cases either within the same surface type or changes in temporal scales individually, with an RMSE of approximately 1.50 K, yet became less compatible in regard to different climate types. Although the overall accuracy is not as stable as that of the physical method, it has the advantages of fast calculation speed and little consideration of the internal structure of the model.
Collapse
|
17
|
An J. Wearable Device-Based Data Collection and Feature Analysis Method for Outdoor Sports. INTERNATIONAL JOURNAL OF DISTRIBUTED SYSTEMS AND TECHNOLOGIES 2022. [DOI: 10.4018/ijdst.307992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In recent years, with the rapid popularization of smart phones and wearable smart devices, it is no longer difficult to obtain a large number of human motion data related to people's heart rate and geographical location, which has spawned a series of running fitness applications, leading to the national running wave and promoting the rapid development of the sports industry. Based on the long short-term memory cyclic neural network, this paper processes, identifies, and analyzes the motion data collected by wearable devices. Through massive data training, a set of accurate auxiliary models of outdoor sports is obtained to help optimize and improve the effect of outdoor sports. The results show that the method proposed in this paper has a higher degree of sports action and feature recognition and can better assist in the completion of outdoor sports.
Collapse
|
18
|
Liu Y, Zhang X, Zhao Z, Zhang B, Cheng L, Li Z. ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:4520-4533. [PMID: 33175690 DOI: 10.1109/tcyb.2020.3029423] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Visual question answering (VQA) has gained increasing attention in both natural language processing and computer vision. The attention mechanism plays a crucial role in relating the question to meaningful image regions for answer inference. However, most existing VQA methods: 1) learn the attention distribution either from free-form regions or detection boxes in the image, which is intractable in answering questions about the foreground object and background form, respectively and 2) neglect the prior knowledge of human attention and learn the attention distribution with an unguided strategy. To fully exploit the advantages of attention, the learned attention distribution should focus more on the question-related image regions, such as human attention for both the questions, about the foreground object and background form. To achieve this, this article proposes a novel VQA model, called adversarial learning of supervised attentions (ALSAs). Specifically, two supervised attention modules: 1) free form-based and 2) detection-based, are designed to exploit the prior knowledge for attention distribution learning. To effectively learn the correlations between the question and image from different views, that is, free-form regions and detection boxes, an adversarial learning mechanism is implemented as an interplay between two supervised attention modules. The adversarial learning reinforces the two attention modules mutually to make the learned multiview features more effective for answer inference. The experiments performed on three commonly used VQA datasets confirm the favorable performance of ALSA.
Collapse
|
19
|
Wang X, Li T, Cheng Y, Chen CLP. Inference-Based Posteriori Parameter Distribution Optimization. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3006-3017. [PMID: 33027029 DOI: 10.1109/tcyb.2020.3023127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Encouraging the agent to explore has always been an important and challenging topic in the field of reinforcement learning (RL). Distributional representation for network parameters or value functions is usually an effective way to improve the exploration ability of the RL agent. However, directly changing the representation form of network parameters from fixed values to function distributions may cause algorithm instability and low learning inefficiency. Therefore, to accelerate and stabilize parameter distribution learning, a novel inference-based posteriori parameter distribution optimization (IPPDO) algorithm is proposed. From the perspective of solving the evidence lower bound of probability, we, respectively, design the objective functions for continuous-action and discrete-action tasks of parameter distribution optimization based on inference. In order to alleviate the overestimation of the value function, we use multiple neural networks to estimate value functions with Retrace, and the smaller estimate participates in the network parameter update; thus, the network parameter distribution can be learned. After that, we design a method used for sampling weight from network parameter distribution by adding an activation function to the standard deviation of parameter distribution, which achieves the adaptive adjustment between fixed values and distribution. Furthermore, this IPPDO is a deep RL (DRL) algorithm based on off-policy, which means that it can effectively improve data efficiency by using off-policy techniques such as experience replay. We compare IPPDO with other prevailing DRL algorithms on the OpenAI Gym and MuJoCo platforms. Experiments on both continuous-action and discrete-action tasks indicate that IPPDO can explore more in the action space, get higher rewards faster, and ensure algorithm stability.
Collapse
|
20
|
Ye X, Huang Y, Lu Q. Automatic Multichannel Electrocardiogram Record Classification Using XGBoost Fusion Model. Front Physiol 2022; 13:840011. [PMID: 35492618 PMCID: PMC9049587 DOI: 10.3389/fphys.2022.840011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Accepted: 03/28/2022] [Indexed: 11/13/2022] Open
Abstract
There is an increasing demand for automatic classification of standard 12-lead electrocardiogram signals in the medical field. Considering that different channels and temporal segments of a feature map extracted from the 12-lead electrocardiogram record contribute differently to cardiac arrhythmia detection, and to the classification performance, we propose a 12-lead electrocardiogram signal automatic classification model based on model fusion (CBi-DF-XGBoost) to focus on representative features along both the spatial and temporal axes. The algorithm extracts local features through a convolutional neural network and then extracts temporal features through bi-directional long short-term memory. Finally, eXtreme Gradient Boosting (XGBoost) is used to fuse the 12-lead models and domain-specific features to obtain the classification results. The 5-fold cross-validation results show that in classifying nine categories of electrocardiogram signals, the macro-average accuracy of the fusion model is 0.968, the macro-average recall rate is 0.814, the macro-average precision is 0.857, the macro-average F1 score is 0.825, and the micro-average area under the curve is 0.919. Similar experiments with some common network structures and other advanced electrocardiogram classification algorithms show that the proposed model performs favourably against other counterparts in F1 score. We also conducted ablation studies to verify the effect of the complementary information from the 12 leads and the auxiliary information of domain-specific features on the classification performance of the model. We demonstrated the feasibility and effectiveness of the XGBoost-based fusion model to classify 12-lead electrocardiogram records into nine common heart rhythms. These findings may have clinical importance for the early diagnosis of arrhythmia and incite further research. In addition, the proposed multichannel feature fusion algorithm can be applied to other similar physiological signal analyses and processing.
Collapse
Affiliation(s)
- Xiaohong Ye
- Chengyi University College, Jimei University, Xiamen, China
| | - Yuanqi Huang
- School of Physical Education and Sport Science, Fujian Normal University, Fuzhou, China
| | - Qiang Lu
- School of Science, Jimei University, Xiamen, China
- *Correspondence: Qiang Lu,
| |
Collapse
|
21
|
de Santana Correia A, Colombini EL. Attention, please! A survey of neural attention models in deep learning. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10148-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
22
|
Liu M, Hu H, Li L, Yu Y, Guan W. Chinese Image Caption Generation via Visual Attention and Topic Modeling. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1247-1257. [PMID: 32568717 DOI: 10.1109/tcyb.2020.2997034] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Automatic image captioning is to conduct the cross-modal conversion from image visual content to natural language text. Involving computer vision (CV) and natural language processing (NLP), it has become one of the most sophisticated research issues in the artificial-intelligence area. Based on the deep neural network, the neural image caption (NIC) model has achieved remarkable performance in image captioning, yet there still remain some essential challenges, such as the deviation between descriptive sentences generated by the model and the intrinsic content expressed by the image, the low accuracy of the image scene description, and the monotony of generated sentences. In addition, most of the current datasets and methods for image captioning are in English. However, considering the distinction between Chinese and English in syntax and semantics, it is necessary to develop specialized Chinese image caption generation methods to accommodate the difference. To solve the aforementioned problems, we design the NICVATP2L model via visual attention and topic modeling, in which the visual attention mechanism reduces the deviation and the topic model improves the accuracy and diversity of generated sentences. Specifically, in the encoding phase, convolutional neural network (CNN) and topic model are used to extract visual and topic features of the input images, respectively. In the decoding phase, an attention mechanism is applied to processing image visual features for obtaining image visual region features. Finally, the topic features and the visual region features are combined to guide the two-layer long short-term memory (LSTM) network for generating Chinese image captions. To justify our model, we have conducted experiments over the Chinese AIC-ICC image dataset. The experimental results show that our model can automatically generate more informative and descriptive captions in Chinese in a more natural way, and it outperforms the existing image captioning NIC model.
Collapse
|
23
|
Tang P, Tan Y, Luo W. Visual and language semantic hybrid enhancement and complementary for video description. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06733-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
24
|
Perez-Martin J, Bustos B, Guimarães SJF, Sipiran I, Pérez J, Said GC. A comprehensive review of the video-to-text problem. Artif Intell Rev 2022. [DOI: 10.1007/s10462-021-10104-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
25
|
Peng L, Yang Y, Wang Z, Huang Z, Shen HT. MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:318-329. [PMID: 32750794 DOI: 10.1109/tpami.2020.3004830] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network (MRA-Net). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two question-adaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both kinds of question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposed model also combines appearance feature with relation feature to reconcile the two types of features effectively. Extensive experiments on five large benchmark datasets, VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC, demonstrate that our proposed model outperforms state-of-the-art approaches.
Collapse
|
26
|
Wu F, Cheng J, Wang X, Wang L, Tao D. Image Hallucination From Attribute Pairs. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:568-581. [PMID: 32275630 DOI: 10.1109/tcyb.2020.2979258] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Recent image-generation methods have demonstrated that realistic images can be produced from captions. Despite the promising results achieved, existing caption-based generation methods confront a dilemma. On the one hand, the image generator should be provided with sufficient details for realistic hallucination, meaning that longer sentences with rich content are preferred, but on the other hand, the generator is meanwhile fragile to long sentences due to their complex semantics and syntax like long-range dependencies and the combinatorial explosion of object visual features. Toward alleviating this dilemma, a novel approach is proposed in this article to hallucinate images from attribute pairs, which can be extracted from natural language processing (NLP) toolsets in the presence of complex semantics and syntax. Attribute pairs, therefore, enable our image generator to tackle long sentences handily and alleviate the combinatorial explosion, and at the same time, allow us to enlarge the training dataset and to produce hallucinations from randomly combined attribute pairs at ease. Experiments on widely used datasets demonstrate that the proposed approach yields results superior to the state of the art.
Collapse
|
27
|
Auto-encoded Latent Representations of White Matter Streamlines for Quantitative Distance Analysis. Neuroinformatics 2022; 20:1105-1120. [PMID: 35731372 PMCID: PMC9588484 DOI: 10.1007/s12021-022-09593-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/10/2022] [Indexed: 12/31/2022]
Abstract
Parcellation of whole brain tractograms is a critical step to study brain white matter structures and connectivity patterns. The existing methods based on supervised classification of streamlines into predefined streamline bundle types are not designed to explore sub-bundle structures, and methods with manually designed features are expensive to compute streamline-wise similarities. To resolve these issues, we propose a novel atlas-free method that learns a latent space using a deep recurrent auto-encoder trained in an unsupervised manner. The method efficiently embeds any length of streamlines to fixed-size feature vectors, named streamline embedding, for tractogram parcellation using non-parametric clustering in the latent space. The method was evaluated on the ISMRM 2015 tractography challenge dataset with discrimination of major bundles using clustering algorithms and streamline querying based on similarity, as well as real tractograms of 102 subjects Human Connectome Project. The learnt latent streamline and bundle representations open the possibility of quantitative studies of arbitrary granularity of sub-bundle structures using generic data mining techniques.
Collapse
|
28
|
Asad M, Jiang H, Yang J, Tu E, Malik AA. Multi-Level Two-Stream Fusion-Based Spatio-Temporal Attention Model for Violence Detection and Localization. INT J PATTERN RECOGN 2021. [DOI: 10.1142/s0218001422550023] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Detection of violent human behavior is necessary for public safety and monitoring. However, it demands constant human observation and attention in human-based surveillance systems, which is a challenging task. Autonomous detection of violent human behavior is therefore essential for continuous uninterrupted video surveillance. In this paper, we propose a novel method for violence detection and localization in videos using the fusion of spatio-temporal features and attention model. The model consists of Fusion Convolutional Neural Network (Fusion-CNN), spatio-temporal attention modules and Bi-directional Convolutional LSTMs (BiConvLSTM). The Fusion-CNN learns both spatial and temporal features by combining multi-level inter-layer features from both RGB and Optical flow input frames. The spatial attention module is used to generate an importance mask to focus on the most important areas of the image frame. The temporal attention part, which is based on BiConvLSTM, identifies the most significant video frames which are related to violent activity. The proposed model can also localize and discriminate prominent regions in both spatial and temporal domains, given the weakly supervised training with only video-level classification labels. Experimental results evaluated on different publicly available benchmarking datasets show the superior performance of the proposed model in comparison with the existing methods. Our model achieves the improved accuracies (ACC) of 89.1%, 99.1% and 98.15% for RWF-2000, HockeyFight and Crowd-Violence datasets, respectively. For CCTV-FIGHTS dataset, we choose the mean average precision (mAp) performance metric and our model obtained 80.7% mAp.
Collapse
Affiliation(s)
- Mujtaba Asad
- Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, P. R. China
| | - He Jiang
- Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, P. R. China
| | - Jie Yang
- Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, P. R. China
| | - Enmei Tu
- Institute of Image Processing & Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, P. R. China
| | - Aftab A. Malik
- Department of Software Engineering, Lahore Garrison University, Lahore 54810, Pakistan
| |
Collapse
|
29
|
Yang H(F, Ke R, Cui Z, Wang Y, Murthy K. Toward a real‐time Smart Parking Data Management and Prediction (SPDMP) system by attributes representation learning. INT J INTELL SYST 2021. [DOI: 10.1002/int.22725] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Hao (Frank) Yang
- Department of Civil and Environmental Engineering University of Washington Seattle Washington USA
| | - Ruimin Ke
- Civil Engineering (Smart Cities) University of Texas at El Paso El Paso Texas USA
| | - Zhiyong Cui
- Department of Civil and Environmental Engineering University of Washington Seattle Washington USA
- eScience Institue University of Washington Seattle Washington USA
| | - Yinhai Wang
- Department of Civil and Environmental Engineering University of Washington Seattle Washington USA
- Department of Electrical and Computer Engineering University of Washington Seattle Washington USA
| | - Karthik Murthy
- Washington State Department of Transportation (WSDOT) Olympia Washington USA
| |
Collapse
|
30
|
Sahoo SP, Ari S, Mahapatra K, Mohanty SP. HAR-Depth: A Novel Framework for Human Action Recognition Using Sequential Learning and Depth Estimated History Images. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2020.3014367] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
31
|
Liu Z, Liu Y, Lyu C, Ye J. Building Personalized Transportation Model for Online Taxi-Hailing Demand Prediction. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4602-4610. [PMID: 32628608 DOI: 10.1109/tcyb.2020.3000929] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The accurate prediction of online taxi-hailing demand is challenging but of significant value in the development of the intelligent transportation system. This article focuses on large-scale online taxi-hailing demand prediction and proposes a personalized demand prediction model. A model with two attention blocks is proposed to capture both spatial and temporal perspectives. We also explored the impact of network architecture on taxi-hailing demand prediction accuracy. The proposed method is universal in the sense that it is applicable to problems associated with large-scale spatiotemporal prediction. The experimental results on city-wide online taxi-hailing demand dataset demonstrate that the proposed personalized demand prediction model achieves superior prediction accuracy.
Collapse
|
32
|
Malektaji S, Ebrahimzadeh A, Elbiaze H, Glitho RH, Kianpisheh S. Deep Reinforcement Learning-Based Content Migration for Edge Content Delivery Networks With Vehicular Nodes. IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT 2021. [DOI: 10.1109/tnsm.2021.3086721] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
33
|
Rahman MM, Abedin T, Prottoy KS, Moshruba A, Siddiqui FH. Video captioning with stacked attention and semantic hard pull. PeerJ Comput Sci 2021; 7:e664. [PMID: 34435104 PMCID: PMC8356660 DOI: 10.7717/peerj-cs.664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 07/16/2021] [Indexed: 06/13/2023]
Abstract
Video captioning, i.e., the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers-one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches-"stacked attention" and "spatial hard pull". As there are no exclusive metrics for evaluating video captioning models, we emphasize both quantitative and qualitative analysis of our model. Hence, we have used the BLEU scoring metric for quantitative analysis and have proposed a human evaluation metric for qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS Score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of state-of-the-art architectures.
Collapse
Affiliation(s)
- Md. Mushfiqur Rahman
- Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh
| | - Thasin Abedin
- Department of Electrical and Electronic Engineering, Islamic University of Technology, Gazipur, Bangladesh
| | - Khondokar S.S. Prottoy
- Department of Electrical and Electronic Engineering, Islamic University of Technology, Gazipur, Bangladesh
| | - Ayana Moshruba
- Department of Electrical and Electronic Engineering, Islamic University of Technology, Gazipur, Bangladesh
| | - Fazlul Hasan Siddiqui
- Department of Computer Science & Engineering, Dhaka University of Engineering and Technology, Gazipur, Bangladesh
| |
Collapse
|
34
|
Automated cardiac segmentation of cross-modal medical images using unsupervised multi-domain adaptation and spatial neural attention structure. Med Image Anal 2021; 72:102135. [PMID: 34182202 DOI: 10.1016/j.media.2021.102135] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 06/10/2021] [Accepted: 06/13/2021] [Indexed: 01/01/2023]
Abstract
Accurate cardiac segmentation of multimodal images, e.g., magnetic resonance (MR), computed tomography (CT) images, plays a pivot role in auxiliary diagnoses, treatments and postoperative assessments of cardiovascular diseases. However, training a well-behaved segmentation model for the cross-modal cardiac image analysis is challenging, due to their diverse appearances/distributions from different devices and acquisition conditions. For instance, a well-trained segmentation model based on the source domain of MR images is often failed in the segmentation of CT images. In this work, a cross-modal images-oriented cardiac segmentation scheme is proposed using a symmetric full convolutional neural network (SFCNN) with the unsupervised multi-domain adaptation (UMDA) and a spatial neural attention (SNA) structure, termed UMDA-SNA-SFCNN, having the merits of without the requirement of any annotation on the test domain. Specifically, UMDA-SNA-SFCNN incorporates SNA to the classic adversarial domain adaptation network to highlight the relevant regions, while restraining the irrelevant areas in the cross-modal images, so as to suppress the negative transfer in the process of unsupervised domain adaptation. In addition, the multi-layer feature discriminators and a predictive segmentation-mask discriminator are established to connect the multi-layer features and segmentation mask of the backbone network, SFCNN, to realize the fine-grained alignment of unsupervised cross-modal feature domains. Extensive confirmative and comparative experiments on the benchmark Multi-Modality Whole Heart Challenge dataset show that the proposed model is superior to the state-of-the-art cross-modal segmentation methods.
Collapse
|
35
|
Abstract
After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.
Collapse
|
36
|
Gao L, Li H, Liu Z, Liu Z, Wan L, Feng W. RNN-Transducer based Chinese Sign Language Recognition. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
37
|
Xu X, Wang T, Yang Y, Hanjalic A, Shen HT. Radial Graph Convolutional Network for Visual Question Generation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:1654-1667. [PMID: 32340964 DOI: 10.1109/tnnls.2020.2986029] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, we address the problem of visual question generation (VQG), a challenge in which a computer is required to generate meaningful questions about an image targeting a given answer. The existing approaches typically treat the VQG task as a reversed visual question answer (VQA) task, requiring the exhaustive match among all the image regions and the given answer. To reduce the complexity, we propose an innovative answer-centric approach termed radial graph convolutional network (Radial-GCN) to focus on the relevant image regions only. Our Radial-GCN method can quickly find the core answer area in an image by matching the latent answer with the semantic labels learned from all image regions. Then, a novel sparse graph of the radial structure is naturally built to capture the associations between the core node (i.e., answer area) and peripheral nodes (i.e., other areas); the graphic attention is subsequently adopted to steer the convolutional propagation toward potentially more relevant nodes for final question generation. Extensive experiments on three benchmark data sets show the superiority of our approach compared with the reference methods. Even in the unexplored challenging zero-shot VQA task, the synthesized questions by our method remarkably boost the performance of several state-of-the-art VQA methods from 0% to over 40%. The implementation code of our proposed method and the successfully generated questions are available at https://github.com/Wangt-CN/VQG-GCN.
Collapse
|
38
|
Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:5412-5425. [PMID: 32071004 DOI: 10.1109/tnnls.2020.2967597] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The task of image-text matching refers to measuring the visual-semantic similarity between an image and a sentence. Recently, the fine-grained matching methods that explore the local alignment between the image regions and the sentence words have shown advance in inferring the image-text correspondence by aggregating pairwise region-word similarity. However, the local alignment is hard to achieve as some important image regions may be inaccurately detected or even missing. Meanwhile, some words with high-level semantics cannot be strictly corresponding to a single-image region. To tackle these problems, we address the importance of exploiting the global semantic consistence between image regions and sentence words as complementary for the local alignment. In this article, we propose a novel hybrid matching approach named Cross-modal Attention with Semantic Consistency (CASC) for image-text matching. The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence. It directly extracts semantic labels from available sentence corpus without additional labor cost, which further provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment. Extensive experiments on Flickr30k and Microsoft COCO (MSCOCO) data sets demonstrate the effectiveness of the proposed CASC on preserving global semantic consistence along with the local alignment and further show its superior image-text matching performance compared with more than 15 state-of-the-art methods.
Collapse
|
39
|
Multi-Sentence Video Captioning using Content-oriented Beam Searching and Multi-stage Refining Algorithm. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2020.102302] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
40
|
Souza CM, Meireles MRG, Almeida PEM. A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset. Scientometrics 2020. [DOI: 10.1007/s11192-020-03732-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
41
|
Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM. Neural Process Lett 2020. [DOI: 10.1007/s11063-020-10352-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
42
|
|
43
|
Zhang D, Wang L, Zhang L, Dai BT, Shen HT. The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2020; 42:2287-2305. [PMID: 31056490 DOI: 10.1109/tpami.2019.2914054] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Solving mathematical word problems (MWPs) automatically is challenging, primarily due to the semantic gap between human-readable words and machine-understandable logics. Despite the long history dated back to the 1960s, MWPs have regained intensive attention in the past few years with the advancement of Artificial Intelligence (AI). Solving MWPs successfully is considered as a milestone towards general AI. Many systems have claimed promising results in self-crafted and small-scale datasets. However, when applied on large and diverse datasets, none of the proposed methods in the literature achieves high precision, revealing that current MWP solvers still have much room for improvement. This motivated us to present a comprehensive survey to deliver a clear and complete picture of automatic math problem solvers. In this survey, we emphasize on algebraic word problems, summarize their extracted features and proposed techniques to bridge the semantic gap, and compare their performance in the publicly accessible datasets. We also cover automatic solvers for other types of math problems such as geometric problems that require the understanding of diagrams. Finally, we identify several emerging research directions for the readers with interests in MWPs.
Collapse
|
44
|
Denis J, Dard RF, Quiroli E, Cossart R, Picardo MA. DeepCINAC: A Deep-Learning-Based Python Toolbox for Inferring Calcium Imaging Neuronal Activity Based on Movie Visualization. eNeuro 2020; 7:ENEURO.0038-20.2020. [PMID: 32699072 PMCID: PMC7438055 DOI: 10.1523/eneuro.0038-20.2020] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 07/06/2020] [Accepted: 07/10/2020] [Indexed: 11/21/2022] Open
Abstract
Two-photon calcium imaging is now widely used to infer neuronal dynamics from changes in fluorescence of an indicator. However, state-of-the-art computational tools are not optimized for the reliable detection of fluorescence transients from highly synchronous neurons located in densely packed regions such as the CA1 pyramidal layer of the hippocampus during early postnatal stages of development. Indeed, the latest analytical tools often lack proper benchmark measurements. To meet this challenge, we first developed a graphical user interface (GUI) allowing for a precise manual detection of all calcium transients from imaged neurons based on the visualization of the calcium imaging movie. Then, we analyzed movies from mouse pups using a convolutional neural network (CNN) with an attention process and a bidirectional long-short term memory (LSTM) network. This method is able to reach human performance and offers a better F1 score (harmonic mean of sensitivity and precision) than CaImAn to infer neural activity in the developing CA1 without any user intervention. It also enables automatically identifying activity originating from GABAergic neurons. Overall, DeepCINAC offers a simple, fast and flexible open-source toolbox for processing a wide variety of calcium imaging datasets while providing the tools to evaluate its performance.
Collapse
Affiliation(s)
| | | | | | - Rosa Cossart
- Aix Marseille Univ, INSERM, INMED, Marseille 13273, France
| | | |
Collapse
|
45
|
A novel privacy-preserving speech recognition framework using bidirectional LSTM. JOURNAL OF CLOUD COMPUTING: ADVANCES, SYSTEMS AND APPLICATIONS 2020. [DOI: 10.1186/s13677-020-00186-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
AbstractUtilizing speech as the transmission medium in Internet of things (IoTs) is an effective way to reduce latency while improving the efficiency of human-machine interaction. In the field of speech recognition, Recurrent Neural Network (RNN) has significant advantages to achieve accuracy improvement on speech recognition. However, some of RNN-based intelligence speech recognition applications are insufficient in the privacy-preserving of speech data, and others with privacy-preserving are time-consuming, especially about model training and speech recognition. Therefore, in this paper we propose a novel Privacy-preserving Speech Recognition framework using Bidirectional Long short-term memory neural network, namely PSRBL. On the one hand, PSRBL designs new functions to construct security activation functions by combing with an additive secret sharing protocol, namely a secure piecewise-linear Sigmoid and a secure piecewise-linear Tanh respectively, to achieve privacy-preserving of speech data during speech recognition process running on edge servers. On the other hand, in order to reduce the time spent on both the training and the recognition of the speech model while keeping high accuracy during speech recognition process, PSRBL first utilizes secure activation functions to refit original activation functions in the bidirectional Long Short-Term Memory neural network (LSTM), and then makes full use of the left and the right context information of speech data by employing bidirectional LSTM. Experiments conducted on the speech dataset TIMIT show that our framework PSRBL performs well. Specifically compared with the state-of-the-art ones, PSRBL significantly reduces the time consumption on both the training and the recognition of the speech model under the premise that PSRBL and the comparisons are consistent in the privacy-preserving of speech data.
Collapse
|
46
|
Zheng N, Du S, Wang J, Zhang H, Cui W, Kang Z, Yang T, Lou B, Chi Y, Long H, Ma M, Yuan Q, Zhang S, Zhang D, Ye F, Xin J. Predicting COVID-19 in China Using Hybrid AI Model. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:2891-2904. [PMID: 32396126 DOI: 10.1109/tcyb.2020.2990162] [Citation(s) in RCA: 112] [Impact Index Per Article: 22.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
The coronavirus disease 2019 (COVID-19) breaking out in late December 2019 is gradually being controlled in China, but it is still spreading rapidly in many other countries and regions worldwide. It is urgent to conduct prediction research on the development and spread of the epidemic. In this article, a hybrid artificial-intelligence (AI) model is proposed for COVID-19 prediction. First, as traditional epidemic models treat all individuals with coronavirus as having the same infection rate, an improved susceptible-infected (ISI) model is proposed to estimate the variety of the infection rates for analyzing the transmission laws and development trend. Second, considering the effects of prevention and control measures and the increase of the public's prevention awareness, the natural language processing (NLP) module and the long short-term memory (LSTM) network are embedded into the ISI model to build the hybrid AI model for COVID-19 prediction. The experimental results on the epidemic data of several typical provinces and cities in China show that individuals with coronavirus have a higher infection rate within the third to eighth days after they were infected, which is more in line with the actual transmission laws of the epidemic. Moreover, compared with the traditional epidemic models, the proposed hybrid AI model can significantly reduce the errors of the prediction results and obtain the mean absolute percentage errors (MAPEs) with 0.52%, 0.38%, 0.05%, and 0.86% for the next six days in Wuhan, Beijing, Shanghai, and countrywide, respectively.
Collapse
|
47
|
Zuo L, He P, Zhang C, Zhang Z. A robust approach to reading recognition of pointer meters based on improved mask-RCNN. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.01.032] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
48
|
|
49
|
Jain DK, Jain R, Upadhyay Y, Kathuria A, Lan X. Deep Refinement: capsule network with attention mechanism-based system for text classification. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04620-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
50
|
Ji Y, Zhan Y, Yang Y, Xu X, Shen F, Shen HT. A Context Knowledge Map Guided Coarse-to-fine Action Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:2742-2752. [PMID: 31725381 DOI: 10.1109/tip.2019.2952088] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Human actions involve a wide variety and a large number of categories, which leads to a big challenge in action recognition. However, according to similarities on human body poses, scenes, interactive objects, human actions can be grouped into some semantic groups, i.e sports, cooking, etc. Therefore, in this paper, we propose a novel approach which recognizes human actions from coarse to fine. Taking full advantage of contributions from high-level semantic contexts, a context knowledge map guided recognition method is designed to realize the coarse-to-fine procedure. In the approach, we define semantic contexts with interactive objects, scenes and body motions in action videos, and build a context knowledge map to automatically define coarse-grained groups. Then fine-grained classifiers are proposed to realize accurate action recognition. The coarse-to-fine procedure narrows action categories in target classifiers, so it is beneficial to improving recognition performance. We evaluate the proposed approach on the CCV, the HMDB-51, and the UCF101 database. Experiments verify its significant effectiveness, on average, improving more than 5% of recognition precisions than current approaches. Compared with the state-of-the-art, it also obtains outstanding performance. The proposed approach achieves higher accuracies of 93.1%, 95.4% and 74.5% in the CCV, the UCF-101 and the HMDB51 database, respectively.
Collapse
|