1
|
Devillers B, Maytie L, VanRullen R. Semi-Supervised Multimodal Representation Learning Through a Global Workspace. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:7843-7857. [PMID: 38954575 DOI: 10.1109/tnnls.2024.3416701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2024]
Abstract
Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations or to translate signals from one domain to another (as in image captioning or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here, we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a "global workspace" (GW): a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from four to seven times less than a fully supervised approach). The GW representation can be used advantageously for downstream classification and cross-modal retrieval tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system's performance.
Collapse
|
2
|
Lai L, Chen J, Zhang Z, Lin G, Wu Q. CMFAN: Cross-Modal Feature Alignment Network for Few-Shot Single-View 3D Reconstruction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5522-5534. [PMID: 38593016 DOI: 10.1109/tnnls.2024.3383039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Few-shot single-view 3D reconstruction learns to reconstruct the novel category objects based on a query image and a few support shapes. However, since the query image and the support shapes are of different modalities, there is an inherent feature misalignment problem damaging the reconstruction. Previous works in the literature do not consider this problem. To this end, we propose the cross-modal feature alignment network (CMFAN) with two novel techniques. One is a strategy for model pretraining, namely, cross-modal contrastive learning (CMCL), here the 2D images and 3D shapes of the same objects compose the positives, and those from different objects form the negatives. With CMCL, the model learns to embed the 2D and 3D modalities of the same object into a tight area in the feature space and push away those from different objects, thus effectively aligning the global cross-modal features. The other is cross-modal feature fusion (CMFF), which further aligns and fuses the local features. Specifically, it first re-represents the local features with the cross-attention operation, making the local features share more information. Then, CMFF generates a descriptor for the support features and attaches it to each local feature vector of the query image with dense concatenation. Moreover, CMFF can be applied to multilevel local features and brings further advantages. We conduct extensive experiments to evaluate the effectiveness of our designs, and CMFAN sets new state-of-the-art performance in all of the 1-/10-/25-shot tasks of ShapeNet and ModelNet datasets.
Collapse
|
3
|
Li Y, Zhang L, Shao L. LR Aerial Photo Categorization by Cross-Resolution Perceptual Knowledge Propagation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3384-3395. [PMID: 38252579 DOI: 10.1109/tnnls.2024.3349515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
There are hundreds of high- and low-altitude earth observation satellites that asynchronously capture massive-scale aerial photographs every day. Generally, high-altitude satellites take low-resolution (LR) aerial pictures, each covering a considerably large area. In contrast, low-altitude satellites capture high-resolution (HR) aerial photos, each depicting a relatively small area. Accurately discovering the semantics of LR aerial photos is an indispensable technique in computer vision. Nevertheless, it is also a challenging task due to: 1) the difficulty to characterize human hierarchical visual perception and 2) the intolerable human resources to label sufficient training data. To handle these problems, a novel cross-resolution perceptual knowledge propagation (CPKP) framework is proposed, focusing on adapting the visual perceptual experiences deeply learned from HR aerial photos to categorize LR ones. Specifically, by mimicking the human vision system, a novel low-rank model is designed to decompose each LR aerial photo into multiple visually/semantically salient foreground regions coupled with the background nonsalient regions. This model can: 1) produce a gaze-shifting path (GSP) simulating human gaze behavior and 2) engineer the deep feature for each GSP. Afterward, a kernel-induced feature selection (FS) algorithm is formulated to obtain a succinct set of deep GSP features discriminative across LR and HR aerial photos. Based on the selected features, the labels from LR and HR aerial photos are collaboratively utilized to train a linear classifier for categorizing LR ones. It is worth emphasizing that, such a CPKP mechanism can effectively optimize the linear classifier training, as labels of HR aerial photos are acquired more conveniently in practice. Comprehensive visualization results and comparative study have validated the superiority of our approach.
Collapse
|
4
|
Jiang Y, Hua C, Feng Y, Gao Y. Hierarchical Set-to-Set Representation for 3-D Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1302-1314. [PMID: 37962998 DOI: 10.1109/tnnls.2023.3326581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2023]
Abstract
Three-dimensional in-domain retrieval has recently achieved significant success, but 3-D cross-modal retrieval still faces problems and challenges. Existing methods only rely on a simple global feature (GF), which overlooks the local information of complex 3-D objects and the connections between similar local features across complex multimodal instances. To tackle this issue, we propose a hierarchical set-to-set representation (HSR) and a corresponding hierarchical similarity that incorporates global-to-global and local-to-local similarity metrics. Specifically, we employ feature extractors for each modality to learn both GFs and local feature sets. We then project these features into their respective common space and use bilinear pooling to generate compact-set features that maintain the invariant for set-to-set similarity measurement. To facilitate effective hierarchical similarity measurement, we design an operation to combine the GF and the compact-set feature to generate the hierarchical representation for 3-D cross-modal retrieval, which preserves hierarchical similarity measurement. To optimize the framework, we adopt the joint loss functions, including cross-modal center loss (CMCL), mean square loss, and cross-entropy loss, to reduce the cross-modal discrepancy for each instance and minimize the distances between the instances in the same category. Experimental results demonstrate that our method outperforms the state-of-the-art methods on the 3-D cross-modal retrieval task on both ModelNet10 and ModelNet40 datasets.
Collapse
|
5
|
Yan X, Mao Y, Ye Y, Yu H. Cross-Modal Clustering With Deep Correlated Information Bottleneck Method. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13508-13522. [PMID: 37220062 DOI: 10.1109/tnnls.2023.3269789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Cross-modal clustering (CMC) intends to improve the clustering accuracy (ACC) by exploiting the correlations across modalities. Although recent research has made impressive advances, it remains a challenge to sufficiently capture the correlations across modalities due to the high-dimensional nonlinear characteristics of individual modalities and the conflicts in heterogeneous modalities. In addition, the meaningless modality-private information in each modality might become dominant in the process of correlation mining, which also interferes with the clustering performance. To tackle these challenges, we devise a novel deep correlated information bottleneck (DCIB) method, which aims at exploring the correlation information between multiple modalities while eliminating the modality-private information in each modality in an end-to-end manner. Specifically, DCIB treats the CMC task as a two-stage data compression procedure, in which the modality-private information in each modality is eliminated under the guidance of the shared representation of multiple modalities. Meanwhile, the correlations between multiple modalities are preserved from the aspects of feature distributions and clustering assignments simultaneously. Finally, the objective of DCIB is formulated as an objective function based on a mutual information measurement, in which a variational optimization approach is proposed to ensure its convergence. Experimental results on four cross-modal datasets validate the superiority of the DCIB. Code is released at https://github.com/Xiaoqiang-Yan/DCIB.
Collapse
|
6
|
Sun X, Yao F, Ding C. Modeling High-Order Relationships: Brain-Inspired Hypergraph-Induced Multimodal-Multitask Framework for Semantic Comprehension. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12142-12156. [PMID: 37028292 DOI: 10.1109/tnnls.2023.3252359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Semantic comprehension aims to reasonably reproduce people's real intentions or thoughts, e.g., sentiment, humor, sarcasm, motivation, and offensiveness, from multiple modalities. It can be instantiated as a multimodal-oriented multitask classification issue and applied to scenarios, such as online public opinion supervision and political stance analysis. Previous methods generally employ multimodal learning alone to deal with varied modalities or solely exploit multitask learning to solve various tasks, a few to unify both into an integrated framework. Moreover, multimodal-multitask cooperative learning could inevitably encounter the challenges of modeling high-order relationships, i.e., intramodal, intermodal, and intertask relationships. Related research of brain sciences proves that the human brain possesses multimodal perception and multitask cognition for semantic comprehension via decomposing, associating, and synthesizing processes. Thus, establishing a brain-inspired semantic comprehension framework to bridge the gap between multimodal and multitask learning becomes the primary motivation of this work. Motivated by the superiority of the hypergraph in modeling high-order relations, in this article, we propose a hypergraph-induced multimodal-multitask (HIMM) network for semantic comprehension. HIMM incorporates monomodal, multimodal, and multitask hypergraph networks to, respectively, mimic the decomposing, associating, and synthesizing processes to tackle the intramodal, intermodal, and intertask relationships accordingly. Furthermore, temporal and spatial hypergraph constructions are designed to model the relationships in the modality with sequential and spatial structures, respectively. Also, we elaborate a hypergraph alternative updating algorithm to ensure that vertices aggregate to update hyperedges and hyperedges converge to update their connected vertices. Experiments on the dataset with two modalities and five tasks verify the effectiveness of HIMM on semantic comprehension.
Collapse
|
7
|
Wang Y, Zhen L, Tan TE, Fu H, Feng Y, Wang Z, Xu X, Goh RSM, Ng Y, Calhoun C, Tan GSW, Sun JK, Liu Y, Ting DSW. Geometric Correspondence-Based Multimodal Learning for Ophthalmic Image Analysis. IEEE TRANSACTIONS ON MEDICAL IMAGING 2024; 43:1945-1957. [PMID: 38206778 DOI: 10.1109/tmi.2024.3352602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/13/2024]
Abstract
Color fundus photography (CFP) and Optical coherence tomography (OCT) images are two of the most widely used modalities in the clinical diagnosis and management of retinal diseases. Despite the widespread use of multimodal imaging in clinical practice, few methods for automated diagnosis of eye diseases utilize correlated and complementary information from multiple modalities effectively. This paper explores how to leverage the information from CFP and OCT images to improve the automated diagnosis of retinal diseases. We propose a novel multimodal learning method, named geometric correspondence-based multimodal learning network (GeCoM-Net), to achieve the fusion of CFP and OCT images. Specifically, inspired by clinical observations, we consider the geometric correspondence between the OCT slice and the CFP region to learn the correlated features of the two modalities for robust fusion. Furthermore, we design a new feature selection strategy to extract discriminative OCT representations by automatically selecting the important feature maps from OCT slices. Unlike the existing multimodal learning methods, GeCoM-Net is the first method that formulates the geometric relationships between the OCT slice and the corresponding region of the CFP image explicitly for CFP and OCT fusion. Experiments have been conducted on a large-scale private dataset and a publicly available dataset to evaluate the effectiveness of GeCoM-Net for diagnosing diabetic macular edema (DME), impaired visual acuity (VA) and glaucoma. The empirical results show that our method outperforms the current state-of-the-art multimodal learning methods by improving the AUROC score 0.4%, 1.9% and 2.9% for DME, VA and glaucoma detection, respectively.
Collapse
|
8
|
Camarena F, Gonzalez-Mendoza M, Chang L. Knowledge Distillation in Video-Based Human Action Recognition: An Intuitive Approach to Efficient and Flexible Model Training. J Imaging 2024; 10:85. [PMID: 38667983 PMCID: PMC11051277 DOI: 10.3390/jimaging10040085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 03/23/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Training a model to recognize human actions in videos is computationally intensive. While modern strategies employ transfer learning methods to make the process more efficient, they still face challenges regarding flexibility and efficiency. Existing solutions are limited in functionality and rely heavily on pretrained architectures, which can restrict their applicability to diverse scenarios. Our work explores knowledge distillation (KD) for enhancing the training of self-supervised video models in three aspects: improving classification accuracy, accelerating model convergence, and increasing model flexibility under regular and limited-data scenarios. We tested our method on the UCF101 dataset using differently balanced proportions: 100%, 50%, 25%, and 2%. We found that using knowledge distillation to guide the model's training outperforms traditional training without affecting the classification accuracy and while reducing the convergence rate of model training in standard settings and a data-scarce environment. Additionally, knowledge distillation enables cross-architecture flexibility, allowing model customization for various applications: from resource-limited to high-performance scenarios.
Collapse
Affiliation(s)
- Fernando Camarena
- School of Engineering and Science, Tecnologico de Monterrey, Nuevo León 64700, Mexico
| | | | | |
Collapse
|
9
|
Kang M, Zhu R, Chen D, Liu X, Yu W. CM-GAN: A Cross-Modal Generative Adversarial Network for Imputing Completely Missing Data in Digital Industry. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2917-2926. [PMID: 37352083 DOI: 10.1109/tnnls.2023.3284666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/25/2023]
Abstract
Multimodal data fusion analysis is essential to model the uncertainty of environment awareness in digital industry. However, due to communication failure and cyberattack, the sampled time-series data often have the issue of data missing. In some extreme cases, part of units are unobservable for a long time, which results in complete data missing (CDM). To impute missing data, many models have been proposed. However, they cannot address the CDM issue, because no observation data of the unobservable units can be obtained in this case. Thus, to address the CDM issue, a novel cross-modal generative adversarial network (CM-GAN) is proposed in this article. It combines the cross-modal data fusion technique and the deep adversarial generation technique to construct a cross-modal data generator. This generator can generate long-term time-series data from widely existing spatio-temporal modal data in modern industrial system, and then impute missing value by replacing them with generated data. To test the performance of CM-GAN, extensive experiments are conducted on photovoltaic (PV) power output dataset. Compared with other baseline models, the performance of CM-GAN is generally better and reaches the state-of-the-art level. Moreover, sufficient ablation studies are conducted to present the contribution of the cross-modal data fusion technique and show the reasonability of parameter settings of CM-GAN. Apart from this, some prediction experiments are also conducted. The results show that the PV data recovered by CM-GAN can provide more predictability information for improving the prediction accuracy of deep learning model.
Collapse
|
10
|
Jiang L, Lu W. Sports competition tactical analysis model of cross-modal transfer learning intelligent robot based on Swin Transformer and CLIP. Front Neurorobot 2023; 17:1275645. [PMID: 37965071 PMCID: PMC10642548 DOI: 10.3389/fnbot.2023.1275645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 10/09/2023] [Indexed: 11/16/2023] Open
Abstract
Introduction This paper presents an innovative Intelligent Robot Sports Competition Tactical Analysis Model that leverages multimodal perception to tackle the pressing challenge of analyzing opponent tactics in sports competitions. The current landscape of sports competition analysis necessitates a comprehensive understanding of opponent strategies. However, traditional methods are often constrained to a single data source or modality, limiting their ability to capture the intricate details of opponent tactics. Methods Our system integrates the Swin Transformer and CLIP models, harnessing cross-modal transfer learning to enable a holistic observation and analysis of opponent tactics. The Swin Transformer is employed to acquire knowledge about opponent action postures and behavioral patterns in basketball or football games, while the CLIP model enhances the system's comprehension of opponent tactical information by establishing semantic associations between images and text. To address potential imbalances and biases between these models, we introduce a cross-modal transfer learning technique that mitigates modal bias issues, thereby enhancing the model's generalization performance on multimodal data. Results Through cross-modal transfer learning, tactical information learned from images by the Swin Transformer is effectively transferred to the CLIP model, providing coaches and athletes with comprehensive tactical insights. Our method is rigorously tested and validated using Sport UV, Sports-1M, HMDB51, and NPU RGB+D datasets. Experimental results demonstrate the system's impressive performance in terms of prediction accuracy, stability, training time, inference time, number of parameters, and computational complexity. Notably, the system outperforms other models, with a remarkable 8.47% lower prediction error (MAE) on the Kinetics dataset, accompanied by a 72.86-second reduction in training time. Discussion The presented system proves to be highly suitable for real-time sports competition assistance and analysis, offering a novel and effective approach for an Intelligent Robot Sports Competition Tactical Analysis Model that maximizes the potential of multimodal perception technology. By harnessing the synergies between the Swin Transformer and CLIP models, we address the limitations of traditional methods and significantly advance the field of sports competition analysis. This innovative model opens up new avenues for comprehensive tactical analysis in sports, benefiting coaches, athletes, and sports enthusiasts alike.
Collapse
Affiliation(s)
- Li Jiang
- School of Physical Education of Yantai University, Yantai, China
| | | |
Collapse
|
11
|
EDMH: Efficient discrete matrix factorization hashing for multi-modal similarity retrieval. Inf Process Manag 2023. [DOI: 10.1016/j.ipm.2023.103301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
12
|
Li D, Zhang L, Zhang J, Xie X. Convolutional Feature Descriptor Selection for Mammogram Classification. IEEE J Biomed Health Inform 2023; 27:1467-1476. [PMID: 37018253 DOI: 10.1109/jbhi.2022.3233535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Breast cancer was the most commonly diagnosed cancer among women worldwide in 2020. Recently, several deep learning-based classification approaches have been proposed to screen breast cancer in mammograms. However, most of these approaches require additional detection or segmentation annotations. Meanwhile, some other image-level label-based methods often pay insufficient attention to lesion areas, which are critical for diagnosis. This study designs a novel deep-learning method for automatically diagnosing breast cancer in mammography, which focuses on the local lesion areas and only utilizes image-level classification labels. In this study, we propose to select discriminative feature descriptors from feature maps instead of identifying lesion areas using precise annotations. And we design a novel adaptive convolutional feature descriptor selection (AFDS) structure based on the distribution of the deep activation map. Specifically, we adopt the triangle threshold strategy to calculate a specific threshold for guiding the activation map to determine which feature descriptors (local areas) are discriminative. Ablation experiments and visualization analysis indicate that the AFDS structure makes the model easier to learn the difference between malignant and benign/normal lesions. Furthermore, since the AFDS structure can be regarded as a highly efficient pooling structure, it can be easily plugged into most existing convolutional neural networks with negligible effort and time consumption. Experimental results on two publicly available INbreast and CBIS-DDSM datasets indicate that the proposed method performs satisfactorily compared with state-of-the-art methods.
Collapse
|
13
|
Zhang X, Yang Y, Shen YW, Zhang KR, Jiang ZK, Ma LT, Ding C, Wang BY, Meng Y, Liu H. Diagnostic accuracy and potential covariates of artificial intelligence for diagnosing orthopedic fractures: a systematic literature review and meta-analysis. Eur Radiol 2022; 32:7196-7216. [PMID: 35754091 DOI: 10.1007/s00330-022-08956-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 05/07/2022] [Accepted: 06/08/2022] [Indexed: 02/05/2023]
Abstract
OBJECTIVES To systematically quantify the diagnostic accuracy and identify potential covariates affecting the performance of artificial intelligence (AI) in diagnosing orthopedic fractures. METHODS PubMed, Embase, Web of Science, and Cochrane Library were systematically searched for studies on AI applications in diagnosing orthopedic fractures from inception to September 29, 2021. Pooled sensitivity and specificity and the area under the receiver operating characteristic curves (AUC) were obtained. This study was registered in the PROSPERO database prior to initiation (CRD 42021254618). RESULTS Thirty-nine were eligible for quantitative analysis. The overall pooled AUC, sensitivity, and specificity were 0.96 (95% CI 0.94-0.98), 90% (95% CI 87-92%), and 92% (95% CI 90-94%), respectively. In subgroup analyses, multicenter designed studies yielded higher sensitivity (92% vs. 88%) and specificity (94% vs. 91%) than single-center studies. AI demonstrated higher sensitivity with transfer learning (with vs. without: 92% vs. 87%) or data augmentation (with vs. without: 92% vs. 87%), compared to those without. Utilizing plain X-rays as input images for AI achieved results comparable to CT (AUC 0.96 vs. 0.96). Moreover, AI achieved comparable results to humans (AUC 0.97 vs. 0.97) and better results than non-expert human readers (AUC 0.98 vs. 0.96; sensitivity 95% vs. 88%). CONCLUSIONS AI demonstrated high accuracy in diagnosing orthopedic fractures from medical images. Larger-scale studies with higher design quality are needed to validate our findings. KEY POINTS • Multicenter study design, application of transfer learning, and data augmentation are closely related to improving the performance of artificial intelligence models in diagnosing orthopedic fractures. • Utilizing plain X-rays as input images for AI to diagnose fractures achieved results comparable to CT (AUC 0.96 vs. 0.96). • AI achieved comparable results to humans (AUC 0.97 vs. 0.97) but was superior to non-expert human readers (AUC 0.98 vs. 0.96, sensitivity 95% vs. 88%) in diagnosing fractures.
Collapse
Affiliation(s)
- Xiang Zhang
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Yi Yang
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Yi-Wei Shen
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Ke-Rui Zhang
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Ze-Kun Jiang
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, 610000, China
| | - Li-Tai Ma
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Chen Ding
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Bei-Yu Wang
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Yang Meng
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China
| | - Hao Liu
- Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, No. 37 Guo Xue Rd, Chengdu, 610041, China.
| |
Collapse
|
14
|
Zhou P, Ying K, Wang Z, Guo D, Bai C. Self-Supervised Enhancement for Named Entity Disambiguation via Multimodal Graph Convolution. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:231-245. [PMID: 35560079 DOI: 10.1109/tnnls.2022.3173179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Named entity disambiguation (NED) finds the specific meaning of an entity mention in a particular context and links it to a target entity. With the emergence of multimedia, the modalities of content on the Internet have become more diverse, which poses difficulties for traditional NED, and the vast amounts of information make it impossible to manually label every kind of ambiguous data to train a practical NED model. In response to this situation, we present MMGraph, which uses multimodal graph convolution to aggregate visual and contextual language information for accurate entity disambiguation for short texts, and a self-supervised simple triplet network (SimTri) that can learn useful representations in multimodal unlabeled data to enhance the effectiveness of NED models. We evaluated these approaches on a new dataset, MMFi, which contains multimodal supervised data and large amounts of unlabeled data. Our experiments confirm the state-of-the-art performance of MMGraph on two widely used benchmarks and MMFi. SimTri further improves the performance of NED methods. The dataset and code are available at https://github.com/LanceZPF/NNED_MMGraph.
Collapse
|
15
|
Wang Z, Luo T, Li M, Zhou JT, Goh RSM, Zhen L. Evolutionary Multi-Objective Model Compression for Deep Neural Networks. IEEE COMPUT INTELL M 2021. [DOI: 10.1109/mci.2021.3084393] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|