1
|
Pan W, Zhao Z, Huang W, Zhang Z, Fu L, Pan Z, Yu J, Wu F. Video Moment Retrieval With Noisy Labels. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:6779-6791. [PMID: 36315534 DOI: 10.1109/tnnls.2022.3212900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Video moment retrieval (VMR) aims to localize the target moment in an untrimmed video according to the given nature language query. The existing algorithms typically rely on clean annotations to train their models. However, making annotations by human labors may introduce much noise. Thus, the video moment retrieval models will not be well trained in practice. In this article, we present a simple yet effective video moment retrieval framework via bottom-up schema, which is in end-to-end manners and robust to noisy label training. Specifically, we extract the multimodal features by syntactic graph convolutional networks and multihead attention layers, which are fused by the cross gates and the bilinear approach. Then, the feature pyramid networks are constructed to encode plentiful scene relationships and capture high semantics. Furthermore, to mitigate the effects of noisy annotations, we devise the multilevel losses characterized by two levels: a frame-level loss that improves noise tolerance and an instance-level loss that reduces adverse effects of negative instances. For the frame level, we adopt the Gaussian smoothing to regard noisy labels as soft labels through the partial fitting. For the instance level, we exploit a pair of structurally identical models to let them teach each other during iterations. This leads to our proposed robust video moment retrieval model, which experimentally and significantly outperforms the state-of-the-art approaches on standard public datasets ActivityCaption and textually annotated cooking scene (TACoS). We also evaluate the proposed approach on the different manual annotation noises to further demonstrate the effectiveness of our model.
Collapse
|
2
|
Wu X, Zhang X, Feng X, Bordallo Lopez M, Liu L. Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach. IEEE TRANSACTIONS ON CYBERNETICS 2024; 54:1523-1536. [PMID: 36417714 DOI: 10.1109/tcyb.2022.3220040] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.
Collapse
|
3
|
Huang X, Gong H. A Dual-Attention Learning Network With Word and Sentence Embedding for Medical Visual Question Answering. IEEE TRANSACTIONS ON MEDICAL IMAGING 2024; 43:832-845. [PMID: 37812550 DOI: 10.1109/tmi.2023.3322868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/11/2023]
Abstract
Research in medical visual question answering (MVQA) can contribute to the development of computer-aided diagnosis. MVQA is a task that aims to predict accurate and convincing answers based on given medical images and associated natural language questions. This task requires extracting medical knowledge-rich feature content and making fine-grained understandings of them. Therefore, constructing an effective feature extraction and understanding scheme are keys to modeling. Existing MVQA question extraction schemes mainly focus on word information, ignoring medical information in the text, such as medical concepts and domain-specific terms. Meanwhile, some visual and textual feature understanding schemes cannot effectively capture the correlation between regions and keywords for reasonable visual reasoning. In this study, a dual-attention learning network with word and sentence embedding (DALNet-WSE) is proposed. We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. A dual-attention learning (DAL) module consisting of self-attention and guided attention is proposed to model intensive intramodal and intermodal interactions. With multiple DAL modules (DALs), learning visual and textual co-attention can increase the granularity of understanding and improve visual reasoning. Experimental results on the ImageCLEF 2019 VQA-MED (VQA-MED 2019) and VQA-RAD datasets demonstrate that our proposed method outperforms previous state-of-the-art methods. According to the ablation studies and Grad-CAM maps, DALNet-WSE can extract rich textual information and has strong visual reasoning ability.
Collapse
|
4
|
Text-based person search via local-relational-global fine grained alignment. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
5
|
Zhang L, Wu X. Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:7154-7164. [PMID: 36355734 DOI: 10.1109/tip.2022.3220051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
As an important field in information retrieval, fine-grained cross-modal retrieval has received great attentions from researchers. Existing fine-grained cross-modal retrieval methods made several improvements in capturing the fine-grained interplay between vision and language, failing to consider the fine-grained correspondences between the features in the image latent space and the text latent space respectively, which may lead to inaccurate inference of intra-modal relations or false alignment of cross-modal information. Considering that object detection can get the fine-grained correspondences of image region features and the corresponding semantic features, this paper proposed a novel latent space semantic supervision model based on knowledge distillation (L3S-KD), which trains classifiers supervised by the fine-grained correspondences obtained from an object detection model by using knowledge distillation for image latent space fine-grained alignment, and by the labels of objects and attributes for text latent space fine-grained alignment. Compared with existing fine-grained correspondence matching methods, L3S-KD can learn more accurate semantic similarities for local fragments in image-text pairs. Extensive experiments on MS-COCO and Flickr30K datasets demonstrate that the L3S-KD model consistently outperforms state-of-the-art methods for image-text matching.
Collapse
|
6
|
Li Z, Lu H, Fu H, Gu G. Parallel Learned Generative Adversarial Network with Multi-path Subspaces for Cross-Modal Retrieval. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.11.087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
7
|
Design of Neural Network Model for Cross-Media Audio and Video Score Recognition Based on Convolutional Neural Network Model. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:4626867. [PMID: 35733575 PMCID: PMC9208963 DOI: 10.1155/2022/4626867] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Revised: 05/09/2022] [Accepted: 05/20/2022] [Indexed: 11/17/2022]
Abstract
In this paper, the residual convolutional neural network is used to extract the note features in the music score image to solve the problem of model degradation; then, multiscale feature fusion is used to fuse the feature information of different levels in the same feature map to enhance the feature representation ability of the model. A network composed of a bidirectional simple loop unit and a chained time series classification function is used to identify notes, parallelizing a large number of calculations, thereby speeding up the convergence speed of training, which also makes the data in the dataset no longer need to be strict with labels. Alignment also reduces the requirements on the dataset. Aiming at the problem that the existing cross-modal retrieval methods based on common subspace are insufficient for mining local consistency within modalities, a cross-modal retrieval method fused with graph convolution is proposed. The K-nearest neighbor algorithm is used to construct modal graphs for samples of different modalities, and the original features of samples from different modalities are encoded through a symmetric graph convolutional coding network and a symmetric multilayer fully connected coding network, and the encoded features are fused and input. We jointly optimize the intramodal semantic constraints and intermodal modality-invariant constraints in the common subspace to learn highly locally consistent and semantically consistent common representations for samples from different modalities. The error value of the experimental results is used to illustrate the effect of parameters such as the number of iterations and the number of neurons on the network. In order to more accurately illustrate that the generated music sequence is very similar to the original music sequence, the generated music sequence is also framed, and finally the music sequence spectrogram and spectrogram are generated. The accuracy of the experiment is illustrated by comparing the spectrogram and the spectrogram, and genre classification predictions are also performed on the generated music to show that the network can generate music of different genres.
Collapse
|
8
|
Zhang L, Gao X. Transfer Adaptation Learning: A Decade Survey. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:23-44. [PMID: 35727786 DOI: 10.1109/tnnls.2022.3183326] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
The world we see is ever-changing and it always changes with people, things, and the environment. Domain is referred to as the state of the world at a certain moment. A research problem is characterized as transfer adaptation learning (TAL) when it needs knowledge correspondence between different moments/domains. TAL aims to build models that can perform tasks of target domain by learning knowledge from a semantic-related but distribution different source domain. It is an energetic research field of increasing influence and importance, which is presenting a blowout publication trend. This article surveys the advances of TAL methodologies in the past decade, and the technical challenges and essential problems of TAL have been observed and discussed with deep insights and new perspectives. Broader solutions of TAL being created by researchers are identified, i.e., instance reweighting adaptation, feature adaptation, classifier adaptation, deep network adaptation, and adversarial adaptation, which are beyond the early semisupervised and unsupervised split. The survey helps researchers rapidly but comprehensively understand and identify the research foundation, research status, theoretical limitations, future challenges, and understudied issues (universality, interpretability, and credibility) to be broken in the field toward generalizable representation in open-world scenarios.
Collapse
|
9
|
He P, Wang M, Tu D, Wang Z. Dual discriminant adversarial cross-modal retrieval. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03653-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
10
|
Xu X, Lin K, Yang Y, Hanjalic A, Shen HT. Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:3030-3047. [PMID: 33332264 DOI: 10.1109/tpami.2020.3045530] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features, is becoming the research hotspot. However, the existing cross-modal GAN approaches typically 1) require labeled multimodal data of massive labor cost to establish cross-modal correlation; 2) utilize the vanilla GAN model that results in unstable training procedure and meaningless synthetic features; and 3) lack of extensibility for retrieving cross-modal data of new classes. In this article, we revisit the adversarial learning in existing cross-modal GAN methods and propose Joint Feature Synthesis and Embedding (JFSE), a novel method that jointly performs multimodal feature synthesis and common embedding space learning to overcome the above three shortcomings. Specifically, JFSE deploys two coupled conditional Wassertein GAN modules for the input data of two modalities, to synthesize meaningful and correlated multimodal features under the guidance of the word embeddings of class labels. Moreover, three advanced distribution alignment schemes with advanced cycle-consistency constraints are proposed to preserve the semantic compatibility and enable the knowledge transfer in the common embedding space for both the true and synthetic cross-modal features. All these add-ons in JFSE not only help to learn more effective common embedding space that effectively captures the cross-modal correlation but also facilitate to transfer knowledge to multimodal data of new classes. Extensive experiments are conducted on four widely used cross-modal datasets, and the comparisons with more than ten state-of-the-art approaches show that our JFSE method achieves remarkably accuracy improvement on both standard retrieval and the newly explored zero-shot and generalized zero-shot retrieval tasks.
Collapse
|
11
|
Application of Radar Solutions for the Purpose of Bird Tracking Systems Based on Video Observation. SENSORS 2022; 22:s22103660. [PMID: 35632076 PMCID: PMC9146798 DOI: 10.3390/s22103660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 04/30/2022] [Accepted: 05/09/2022] [Indexed: 11/28/2022]
Abstract
Wildlife Hazard Management is nowadays a very serious problem, mostly at airports and wind farms. If ignored, it may lead to repercussions in human safety, ecology, and economics. One of the approaches that is widely implemented in small and medium-size airports, as well as on wind turbines is based on a stereo-vision. However, to provide long-term observations allowing the determination of the hot spots of birds’ activity and forecast future events, a robust tracking algorithm is required. The aim of this paper is to review tracking algorithms widely used in Radar Science and assess the possibilities of application of these algorithms for the purpose of tracking birds with a stereo-vision system. We performed a survey-of-related works and simulations determined five state-of-the art algorithms: Kalman Filter, Nearest-Neighbour, Joint-Probabilistic Data Association, and Interacting Multiple Model with the potential for implementation in a stereo-vision system. These algorithms have been implemented and simulated in the proposed case study
Collapse
|
12
|
Xu X, Lin K, Gao L, Lu H, Shen HT, Li X. Learning Cross-Modal Common Representations by Private-Shared Subspaces Separation. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3261-3275. [PMID: 32780706 DOI: 10.1109/tcyb.2020.3009004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Due to the inconsistent distributions and representations of different modalities (e.g., images and texts), it is very challenging to correlate such heterogeneous data. A standard solution is to construct one common subspace, where the common representations of different modalities are generated to bridge the heterogeneity gap. Existing methods based on common representation learning mostly adopt a less effective two-stage paradigm: first, generating separate representations for each modality by exploiting the modality-specific properties as the complementary information, and then capturing the cross-modal correlation in the separate representations for common representation learning. Moreover, these methods usually neglect that there may exist interference in the modality-specific properties, that is, the unrelated objects and background regions in images or the noisy words and incorrect sentences in the text. In this article, we hypothesize that explicitly modeling the interference within each modality can improve the quality of common representation learning. To this end, we propose a novel model private-shared subspaces separation (P3S) to explicitly learn different representations that are partitioned into two kinds of subspaces: 1) the common representations that capture the cross-modal correlation in a shared subspace and 2) the private representations that model the interference within each modality in two private subspaces. By employing the orthogonality constraints between the shared subspace and the private subspaces during the one-stage joint learning procedure, our model is able to learn more effective common representations for different modalities in the shared subspace by fully excluding the interference within each modality. Extensive experiments conducted on cross-modal retrieval verify the advantages of our P3S method compared with 15 state-of-the-art methods on four widely used cross-modal datasets.
Collapse
|
13
|
Dual Modality Collaborative Learning for Cross-Source Remote Sensing Retrieval. REMOTE SENSING 2022. [DOI: 10.3390/rs14061319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Content-based remote sensing (RS) image retrieval (CBRSIR) is a critical way to organize high-resolution RS (HRRS) images in the current big data era. The increasing volume of HRRS images from different satellites and sensors leads to more attention to the cross-source CSRSIR (CS-CBRSIR) problem. Due to the data drift, one crucial problem in CS-CBRSIR is the modality discrepancy. Most existing methods focus on finding a common feature space for various HRRS images to address this issue. In this space, their similarity relations can be measured directly to obtain the cross-source retrieval results straight. This way is feasible and reasonable, however, the specific information corresponding to HRRS images from different sources is always ignored, limiting retrieval performance. To overcome this limitation, we develop a new model for CS-CBRSIR in this paper named dual modality collaborative learning (DMCL). To fully explore the specific information from diverse HRRS images, DMCL first introduces ResNet50 as the feature extractor. Then, a common space mutual learning module is developed to map the specific features into a common space. Here, the modality discrepancy is reduced from the aspects of features and their distributions. Finally, to supplement the specific knowledge to the common features, we develop modality transformation and the dual-modality feature learning modules. Their function is to transmit the specific knowledge from different sources mutually and fuse the specific and common features adaptively. The comprehensive experiments are conducted on a public dataset. Compared with many existing methods, the behavior of our DMCL is stronger. These encouraging results for a public dataset indicate that the proposed DMCL is useful in CS-CBRSIR tasks.
Collapse
|
14
|
Zhen L, Hu P, Peng X, Goh RSM, Zhou JT. Deep Multimodal Transfer Learning for Cross-Modal Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:798-810. [PMID: 33090960 DOI: 10.1109/tnnls.2020.3029181] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Cross-modal retrieval (CMR) enables flexible retrieval experience across different modalities (e.g., texts versus images), which maximally benefits us from the abundance of multimedia data. Existing deep CMR approaches commonly require a large amount of labeled data for training to achieve high performance. However, it is time-consuming and expensive to annotate the multimedia data manually. Thus, how to transfer valuable knowledge from existing annotated data to new data, especially from the known categories to new categories, becomes attractive for real-world applications. To achieve this end, we propose a deep multimodal transfer learning (DMTL) approach to transfer the knowledge from the previously labeled categories (source domain) to improve the retrieval performance on the unlabeled new categories (target domain). Specifically, we employ a joint learning paradigm to transfer knowledge by assigning a pseudolabel to each target sample. During training, the pseudolabel is iteratively updated and passed through our model in a self-supervised manner. At the same time, to reduce the domain discrepancy of different modalities, we construct multiple modality-specific neural networks to learn a shared semantic space for different modalities by enforcing the compactness of homoinstance samples and the scatters of heteroinstance samples. Our method is remarkably different from most of the existing transfer learning approaches. To be specific, previous works usually assume that the source domain and the target domain have the same label set. In contrast, our method considers a more challenging multimodal learning situation where the label sets of the two domains are different or even disjoint. Experimental studies on four widely used benchmarks validate the effectiveness of the proposed method in multimodal transfer learning and demonstrate its superior performance in CMR compared with 11 state-of-the-art methods.
Collapse
|
15
|
Ji Z, Wang H, Han J, Pang Y. SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1086-1097. [PMID: 32386178 DOI: 10.1109/tcyb.2020.2985716] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods.
Collapse
|
16
|
Abstract
Cross-modal retrieval aims to search samples of one modality via queries of other modalities, which is a hot issue in the community of multimedia. However, two main challenges, i.e., heterogeneity gap and semantic interaction across different modalities, have not been solved efficaciously. Reducing the heterogeneous gap can improve the cross-modal similarity measurement. Meanwhile, modeling cross-modal semantic interaction can capture the semantic correlations more accurately. To this end, this paper presents a novel end-to-end framework, called Dual Attention Generative Adversarial Network (DA-GAN). This technique is an adversarial semantic representation model with a dual attention mechanism, i.e., intra-modal attention and inter-modal attention. Intra-modal attention is used to focus on the important semantic feature within a modality, while inter-modal attention is to explore the semantic interaction between different modalities and then represent the high-level semantic correlation more precisely. A dual adversarial learning strategy is designed to generate modality-invariant representations, which can reduce the cross-modal heterogeneity efficiently. The experiments on three commonly used benchmarks show the better performance of DA-GAN than these competitors.
Collapse
|
17
|
Semantic-guided autoencoder adversarial hashing for large-scale cross-modal retrieval. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-021-00615-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
AbstractWith the vigorous development of mobile Internet technology and the popularization of smart devices, while the amount of multimedia data has exploded, its forms have become more and more diversified. People’s demand for information is no longer satisfied with single-modal data retrieval, and cross-modal retrieval has become a research hotspot in recent years. Due to the strong feature learning ability of deep learning, cross-modal deep hashing has been extensively studied. However, the similarity of different modalities is difficult to measure directly because of the different distribution and representation of cross-modal. Therefore, it is urgent to eliminate the modal gap and improve retrieval accuracy. Some previous research work has introduced GANs in cross-modal hashing to reduce semantic differences between different modalities. However, most of the existing GAN-based cross-modal hashing methods have some issues such as network training is unstable and gradient disappears, which affect the elimination of modal differences. To solve this issue, this paper proposed a novel Semantic-guided Autoencoder Adversarial Hashing method for cross-modal retrieval (SAAH). First of all, two kinds of adversarial autoencoder networks, under the guidance of semantic multi-labels, maximize the semantic relevance of instances and maintain the immutability of cross-modal. Secondly, under the supervision of semantics, the adversarial module guides the feature learning process and maintains the modality relations. In addition, to maintain the inter-modal correlation of all similar pairs, this paper use two types of loss functions to maintain the similarity. To verify the effectiveness of our proposed method, sufficient experiments were conducted on three widely used cross-modal datasets (MIRFLICKR, NUS-WIDE and MS COCO), and compared with several representatives advanced cross-modal retrieval methods, SAAH achieved leading retrieval performance.
Collapse
|
18
|
Liu H, Ko YC. Cross-Media Intelligent Perception and Retrieval Analysis Application Technology Based on Deep Learning Education. INT J PATTERN RECOGN 2021. [DOI: 10.1142/s0218001421520236] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This paper proposes a research on cross-media intelligent perception and retrieval analysis technology based on deep learning education in view of the complex learning knowledge environment of artificial intelligence of traditional cross-media retrieval technology, which is unable to obtain retrieval information timely and accurately. Based on the cross-media theory, this paper analyzes the pre-processing of cross-media data, transforms from a single media form such as text, voice, image, and video to a cross-media integration covering network space and physical space, and designs a cross-media intelligent perception platform system. With multi-core method of typical correlation analysis algorithm, this paper develops a new joint learning framework based on the joint feature selection and subspace learning cross-modal retrieval. What’s more, the algorithm is tested experimentally. The results show that the retrieval analysis technology is significantly better than the traditional media retrieval technology, which can effectively identify text semantics and visual image classification, and can better maintains the relevance of data content and the consistency of semantic information, which has important reference value for cross-media applications.
Collapse
Affiliation(s)
- Hean Liu
- College of Science, Hunan City University, Yiyang Hunan 413000, P. R. China
- Graduate School of Sehan University, Chonnam 58447, Korea
| | - Young Chun Ko
- Department of Teaching Profession, Sehan University, Chonnam 58447, Korea
| |
Collapse
|
19
|
Xu J, Yao Y, Xu B, Li Y, Su Z. Unsupervised learning of cross-modal mappings in multi-omics data for survival stratification of gastric cancer. Future Oncol 2021; 18:215-230. [PMID: 34854737 DOI: 10.2217/fon-2021-1059] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Aims: This study presents a survival stratification model based on multi-omics integration using bidirectional deep neural networks (BiDNNs) in gastric cancer. Methods: Based on the survival-related representation features yielded by BiDNNs through integrating transcriptomics and epigenomics data, K-means clustering analysis was performed to cluster tumor samples into different survival subgroups. The BiDNNs-based model was validated using tenfold cross-validation and in two independent confirmation cohorts. Results: Using the BiDNNs-based survival stratification model, patients were grouped into two survival subgroups with log-rank p-value = 9.05E-05. The subgroups classification was robustly validated in tenfold cross-validation (C-index = 0.65 ± 0.02) and in two confirmation cohorts (E-GEOD-26253, C-index = 0.609; E-GEOD-62254, C-index = 0.706). Conclusion: We propose and validate a robust and stable BiDNN-based survival stratification model in gastric cancer.
Collapse
Affiliation(s)
- Jianmin Xu
- Department of Gastrointestinal Surgery, Affiliated Hospital of Jiangnan University, Wuxi City, Jiangsu Province, 214122, China
| | - Yueping Yao
- Department of Liver Disease, Wuxi No. 5 People's Hospital Affiliated to Jiangnan University, 1215 Guangrui Road, Wuxi Liangxi District, Wuxi City, Jiangsu Province, 214011, China
| | - Binghua Xu
- Department of Gastrointestinal Surgery, Affiliated Hospital of Jiangnan University, Wuxi City, Jiangsu Province, 214122, China
| | - Yipeng Li
- PerMed Biomedicine Institute, Shanghai 201318, China
| | - Zhijian Su
- Department of Gastrointestinal Surgery, Affiliated Hospital of Jiangnan University, Wuxi City, Jiangsu Province, 214122, China
| |
Collapse
|
20
|
Huang X, Peng Y, Wen Z. Visual-Textual Hybrid Sequence Matching for Joint Reasoning. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:5692-5705. [PMID: 31905158 DOI: 10.1109/tcyb.2019.2956975] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Reasoning is one of the central topics in artificial intelligence. As an important reasoning paradigm, entailment recognition has attracted much research interest, which judges if a hypothesis can be inferred from given premises. However, existing research mainly focuses on text-based analysis, that is, recognizing textual entailment (RTE), which limits its depth and width. Actually, the knowledge and inference process of human are across different sensory organs like language and vision, with unique perspectives to represent complementary reasoning cues. It is significant to extend existing entailment recognition research to cross-media scenarios, that is, recognizing cross-media entailment (RCE). Therefore, this article focuses on one representative RCE task: visual-textual reasoning, and proposes the visual-textual hybrid sequence matching (VHSM) approach. VHSM can reason from image-text premises to text hypotheses, whose contributions are: 1) visual-textual hybrid multicontext inference is proposed to address RCE via matching with hybrid context embeddings, along with adaptive gated aggregation to obtain the final prediction results. It can fully exploit complementary visual-textual cue interaction during joint reasoning; 2) memory attention-based context embedding is proposed to sequentially encode hybrid context embeddings, with the memory attention networks to compare neighboring time-steps. This can capture the important memory dimensions by coefficient assignment, which fully exploits the visual-textual context correlation; and 3) cross-task and visual-textual transfer strategy is further proposed to enrich correlation training information for boosting reasoning accuracy, which transfers knowledge not only from cross-media retrieval task to RCE but also between corresponding text and image premises. The experimental results of recognizing visual-textual entailment task on the SNLI dataset verify the effectiveness of VHSM.
Collapse
|
21
|
Liu Y, Zhang X, Huang F, Cheng L, Li Z. Adversarial Learning With Multi-Modal Attention for Visual Question Answering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:3894-3908. [PMID: 32833656 DOI: 10.1109/tnnls.2020.3016083] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Visual question answering (VQA) has been proposed as a challenging task and attracted extensive research attention. It aims to learn a joint representation of the question-image pair for answer inference. Most of the existing methods focus on exploring the multi-modal correlation between the question and image to learn the joint representation. However, the answer-related information is not fully captured by these methods, which results that the learned representation is ineffective to reflect the answer of the question. To tackle this problem, we propose a novel model, i.e., adversarial learning with multi-modal attention (ALMA), for VQA. An adversarial learning-based framework is proposed to learn the joint representation to effectively reflect the answer-related information. Specifically, multi-modal attention with the Siamese similarity learning method is designed to build two embedding generators, i.e., question-image embedding and question-answer embedding. Then, adversarial learning is conducted as an interplay between the two embedding generators and an embedding discriminator. The generators have the purpose of generating two modality-invariant representations for the question-image and question-answer pairs, whereas the embedding discriminator aims to discriminate the two representations. Both the multi-modal attention module and the adversarial networks are integrated into an end-to-end unified framework to infer the answer. Experiments performed on three benchmark data sets confirm the favorable performance of ALMA compared with state-of-the-art approaches.
Collapse
|
22
|
Li W, Huan W, Hou B, Tian Y, Zhang Z, Song A. Can Emotion be Transferred? – A Review on Transfer Learning for EEG-Based Emotion Recognition. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2021.3098842] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
23
|
Huang Z, Zhou JT, Zhu H, Zhang C, Lv J, Peng X. Deep Spectral Representation Learning From Multi-View Data. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5352-5362. [PMID: 34081580 DOI: 10.1109/tip.2021.3083072] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Multi-view representation learning (MvRL) aims to learn a consensus representation from diverse sources or domains to facilitate downstream tasks such as clustering, retrieval, and classification. Due to the limited representative capacity of the adopted shallow models, most existing MvRL methods may yield unsatisfactory results, especially when the labels of data are unavailable. To enjoy the representative capacity of deep learning, this paper proposes a novel multi-view unsupervised representation learning method, termed as Multi-view Laplacian Network (MvLNet), which could be the first deep version of the multi-view spectral representation learning method. Note that, such an attempt is nontrivial because simply combining Laplacian embedding (i.e., spectral representation) with neural networks will lead to trivial solutions. To solve this problem, MvLNet enforces an orthogonal constraint and reformulates it as a layer with the help of Cholesky decomposition. The orthogonal layer is stacked on the embedding network so that a common space could be learned for consensus representation. Compared with numerous recent-proposed approaches, extensive experiments on seven challenging datasets demonstrate the effectiveness of our method in three multi-view tasks including clustering, recognition, and retrieval. The source code could be found at www.pengxi.me.
Collapse
|
24
|
Bridging multimedia heterogeneity gap via Graph Representation Learning for cross-modal retrieval. Neural Netw 2020; 134:143-162. [PMID: 33310483 DOI: 10.1016/j.neunet.2020.11.011] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Revised: 11/10/2020] [Accepted: 11/23/2020] [Indexed: 11/23/2022]
Abstract
Information retrieval among different modalities becomes a significant issue with many promising applications. However, inconsistent feature representation of various multimedia data causes the "heterogeneity gap" among various modalities, which is a challenge in cross-modal retrieval. For bridging the "heterogeneity gap," the popular methods attempt to project the original data into a common representation space, which needs great fitting ability of the model. To address the above issue, we propose a novel Graph Representation Learning (GRL) method for bridging the heterogeneity gap, which does not project the original feature into an aligned representation space but adopts a cross-modal graph to link different modalities. The GRL approach consists of two subnetworks, Feature Transfer Learning Network (FTLN) and Graph Representation Learning Network (GRLN). Firstly, FTLN model finds a latent space for each modality, where the cosine similarity is suitable to describe their similarity. Then, we build a cross-modal graph to reconstruct the original data and their relationships. Finally, we abandon the features in the latent space and turn into embedding the graph vertexes into a common representation space directly. During the process, the proposed Graph Representation Learning method bypasses the most challenging issue by utilizing a cross-modal graph as a bridge to link the "heterogeneity gap" among different modalities. This attempt utilizes a cross-modal graph as an intermediary agent to bridge the "heterogeneity gap" in cross-modal retrieval, which is simple but effective. Extensive experiment results on six widely-used datasets indicate that the proposed GRL outperforms other state-of-the-art cross-modal retrieval methods.
Collapse
|
25
|
Yu B, Zhou L, Wang L, Shi Y, Fripp J, Bourgeat P. Sample-Adaptive GANs: Linking Global and Local Mappings for Cross-Modality MR Image Synthesis. IEEE TRANSACTIONS ON MEDICAL IMAGING 2020; 39:2339-2350. [PMID: 31995478 DOI: 10.1109/tmi.2020.2969630] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Generative adversarial network (GAN) has been widely explored for cross-modality medical image synthesis. The existing GAN models usually adversarially learn a global sample space mapping from the source-modality to the target-modality and then indiscriminately apply this mapping to all samples in the whole space for prediction. However, due to the scarcity of training samples in contrast to the complicated nature of medical image synthesis, learning a single global sample space mapping that is "optimal" to all samples is very challenging, if not intractable. To address this issue, this paper proposes sample-adaptive GAN models, which not only cater for the global sample space mapping between the source- and the target-modalities but also explore the local space around each given sample to extract its unique characteristic. Specifically, the proposed sample-adaptive GANs decompose the entire learning model into two cooperative paths. The baseline path learns a common GAN model by fitting all the training samples as usual for the global sample space mapping. The new sample-adaptive path additionally models each sample by learning its relationship with its neighboring training samples and using the target-modality features of these training samples as auxiliary information for synthesis. Enhanced by this sample-adaptive path, the proposed sample-adaptive GANs are able to flexibly adjust themselves to different samples, and therefore optimize the synthesis performance. Our models have been verified on three cross-modality MR image synthesis tasks from two public datasets, and they significantly outperform the state-of-the-art methods in comparison. Moreover, the experiment also indicates that our sample-adaptive strategy could be utilized to improve various backbone GAN models. It complements the existing GANs models and can be readily integrated when needed.
Collapse
|
26
|
Xu X, Lu H, Song J, Yang Y, Shen HT, Li X. Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:2400-2413. [PMID: 31352361 DOI: 10.1109/tcyb.2019.2928180] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Given a query instance from one modality (e.g., image), cross-modal retrieval aims to find semantically similar instances from another modality (e.g., text). To perform cross-modal retrieval, existing approaches typically learn a common semantic space from a labeled source set and directly produce common representations in the learned space for the instances in a target set. These methods commonly require that the instances of both two sets share the same classes. Intuitively, they may not generalize well on a more practical scenario of zero-shot cross-modal retrieval, that is, the instances of the target set contain unseen classes that have inconsistent semantics with the seen classes in the source set. Inspired by zero-shot learning, we propose a novel model called ternary adversarial networks with self-supervision (TANSS) in this paper, to overcome the limitation of the existing methods on this challenging task. Our TANSS approach consists of three paralleled subnetworks: 1) two semantic feature learning subnetworks that capture the intrinsic data structures of different modalities and preserve the modality relationships via semantic features in the common semantic space; 2) a self-supervised semantic subnetwork that leverages the word vectors of both seen and unseen labels as guidance to supervise the semantic feature learning and enhances the knowledge transfer to unseen labels; and 3) we also utilize the adversarial learning scheme in our TANSS to maximize the consistency and correlation of the semantic features between different modalities. The three subnetworks are integrated in our TANSS to formulate an end-to-end network architecture which enables efficient iterative parameter optimization. Comprehensive experiments on three cross-modal datasets show the effectiveness of our TANSS approach compared with the state-of-the-art methods for zero-shot cross-modal retrieval.
Collapse
|