1
|
Yan W, Zhang Y, Tang C, Zhou W, Lin W. Anchor-Sharing and Cluster-Wise Contrastive Network for Multiview Representation Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3797-3807. [PMID: 38335084 DOI: 10.1109/tnnls.2024.3357087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/12/2024]
Abstract
Multiview clustering (MVC) has gained significant attention as it enables the partitioning of samples into their respective categories through unsupervised learning. However, there are a few issues as follows: 1) many existing deep clustering methods use the same latent features to achieve the conflict objectives, namely, reconstruction and view consistency. The reconstruction objective aims to preserve view-specific features for each individual view, while the view-consistency objective strives to obtain common features across all views; 2) some deep embedded clustering (DEC) approaches adopt view-wise fusion to obtain consensus feature representation. However, these approaches overlook the correlation between samples, making it challenging to derive discriminative consensus representations; and 3) many methods use contrastive learning (CL) to align the view's representations; however, they do not take into account cluster information during the construction of sample pairs, which can lead to the presence of false negative pairs. To address these issues, we propose a novel multiview representation learning network, called anchor-sharing and clusterwise CL (CwCL) network for multiview representation learning. Specifically, we separate view-specific learning and view-common learning into different network branches, which addresses the conflict between reconstruction and consistency. Second, we design an anchor-sharing feature aggregation (ASFA) module, which learns the sharing anchors from different batch data samples, establishes the bipartite relationship between anchors and samples, and further leverages it to improve the samples' representations. This module enhances the discriminative power of the common representation from different samples. Third, we design CwCL module, which incorporates the learned transition probability into CL, allowing us to focus on minimizing the similarity between representations from negative pairs with a low transition probability. It alleviates the conflict in previous sample-level contrastive alignment. Experimental results demonstrate that our method outperforms the state-of-the-art performance.
Collapse
|
2
|
Fan W, Zhang C, Li H, Jia X, Wang G. Three-Stage Semisupervised Cross-Modal Hashing With Pairwise Relations Exploitation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:260-273. [PMID: 37023166 DOI: 10.1109/tnnls.2023.3263221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Hashing methods have sparked a great revolution in cross-modal retrieval due to the low cost of storage and computation. Benefiting from the sufficient semantic information of labeled data, supervised hashing methods have shown better performance compared with unsupervised ones. Nevertheless, it is expensive and labor intensive to annotate the training samples, which restricts the feasibility of supervised methods in real applications. To deal with this limitation, a novel semisupervised hashing method, i.e., three-stage semisupervised hashing (TS3H) is proposed in this article, where both labeled and unlabeled data are seamlessly handled. Different from other semisupervised approaches that learn the pseudolabels, hash codes, and hash functions simultaneously, the new approach is decomposed into three stages as the name implies, in which all of the stages are conducted individually to make the optimization cost-effective and precise. Specifically, the classifiers of different modalities are learned via the provided supervised information to predict the labels of unlabeled data at first. Then, hash code learning is achieved with a simple but efficient scheme by unifying the provided and the newly predicted labels. To capture the discriminative information and preserve the semantic similarities, we leverage pairwise relations to supervise both classifier learning and hash code learning. Finally, the modality-specific hash functions are obtained by transforming the training samples to the generated hash codes. The new approach is compared with the state-of-the-art shallow and deep cross-modal hashing (DCMH) methods on several widely used benchmark databases, and the experiment results verify its efficiency and superiority.
Collapse
|
3
|
Jin L, Li Z, Pan Y, Tang J. Relational Consistency Induced Self-Supervised Hashing for Image Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1482-1494. [PMID: 37995167 DOI: 10.1109/tnnls.2023.3333294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2023]
Abstract
This article proposes a new hashing framework named relational consistency induced self-supervised hashing (RCSH) for large-scale image retrieval. To capture the potential semantic structure of data, RCSH explores the relational consistency between data samples in different spaces, which learns reliable data relationships in the latent feature space and then preserves the learned relationships in the Hamming space. The data relationships are uncovered by learning a set of prototypes that group similar data samples in the latent feature space. By uncovering the semantic structure of the data, meaningful data-to-prototype and data-to-data relationships are jointly constructed. The data-to-prototype relationships are captured by constraining the prototype assignments generated from different augmented views of an image to be the same. Meanwhile, these data-to-prototype relationships are preserved to learn informative compact hash codes by matching them with these reliable prototypes. To accomplish this, a novel dual prototype contrastive loss is proposed to maximize the agreement of prototype assignments in the latent feature space and Hamming space. The data-to-data relationships are captured by enforcing the distribution of pairwise similarities in the latent feature space and Hamming space to be consistent, which makes the learned hash codes preserve meaningful similarity relationships. Extensive experimental results on four widely used image retrieval datasets demonstrate that the proposed method significantly outperforms the state-of-the-art methods. Besides, the proposed method achieves promising performance in out-of-domain retrieval tasks, which shows its good generalization ability. The source code and models are available at https://github.com/IMAG-LuJin/RCSH.
Collapse
|
4
|
Yuan L, Wang T, Zhang X, Tay FEH, Jie Z, Tian Y, Liu W, Feng J. Learnable Central Similarity Quantization for Efficient Image and Video Retrieval. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18717-18730. [PMID: 38090871 DOI: 10.1109/tnnls.2023.3321148] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Data-dependent hashing methods aim to learn hash functions from the pairwise or triplet relationships among the data, which often lead to low efficiency and low collision rate by only capturing the local distribution of the data. To solve the limitation, we propose central similarity, in which the hash codes of similar data pairs are encouraged to approach a common center and those of dissimilar pairs to converge to different centers. As a new global similarity metric, central similarity can improve the efficiency and retrieval accuracy of hash learning. By introducing a new concept, hash centers, we principally formulate the computation of the proposed central similarity metric, in which the hash centers refer to a set of points scattered in the Hamming space with a sufficient mutual distance between each other. To construct well-separated hash centers, we provide two efficient methods: 1) leveraging the Hadamard matrix and Bernoulli distributions to generate data-independent hash centers and 2) learning data-dependent hash centers from data representations. Based on the proposed similarity metric and hash centers, we propose central similarity quantization (CSQ) that optimizes the central similarity between data points with respect to their hash centers instead of optimizing the local similarity to generate a high-quality deep hash function. We also further improve the CSQ with data-dependent hash centers, dubbed as CSQ with learnable center (CSQLC). The proposed CSQ and CSQLC are generic and applicable to image and video hashing scenarios. We conduct extensive experiments on large-scale image and video retrieval tasks, and the proposed CSQ yields noticeably boosted retrieval performance, i.e., 3%-20% in mean average precision (mAP) over the previous state-of-the-art methods, which also demonstrates that our methods can generate cohesive hash codes for similar data pairs and dispersed hash codes for dissimilar pairs.
Collapse
|
5
|
Zheng Q, Yang X, Wang S, An X, Liu Q. Asymmetric double-winged multi-view clustering network for exploring diverse and consistent information. Neural Netw 2024; 179:106563. [PMID: 39111164 DOI: 10.1016/j.neunet.2024.106563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 05/24/2024] [Accepted: 07/20/2024] [Indexed: 09/18/2024]
Abstract
In unsupervised scenarios, deep contrastive multi-view clustering (DCMVC) is becoming a hot research spot, which aims to mine the potential relationships between different views. Most existing DCMVC algorithms focus on exploring the consistency information for the deep semantic features, while ignoring the diverse information on shallow features. To fill this gap, we propose a novel multi-view clustering network termed CodingNet to explore the diverse and consistent information simultaneously in this paper. Specifically, instead of utilizing the conventional auto-encoder, we design an asymmetric structure network to extract shallow and deep features separately. Then, by approximating the similarity matrix on the shallow feature to the zero matrix, we ensure the diversity for the shallow features, thus offering a better description of multi-view data. Moreover, we propose a dual contrastive mechanism that maintains consistency for deep features at both view-feature and pseudo-label levels. Our framework's efficacy is validated through extensive experiments on six widely used benchmark datasets, outperforming most state-of-the-art multi-view clustering algorithms.
Collapse
Affiliation(s)
- Qun Zheng
- School of Earth and Space Sciences, CMA-USTC Laboratory of Fengyun Remote Sensing, University of Science and Technology of China, Hefei 230026, China
| | - Xihong Yang
- College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
| | - Siwei Wang
- College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
| | - Xinru An
- School of Earth and Space Sciences, CMA-USTC Laboratory of Fengyun Remote Sensing, University of Science and Technology of China, Hefei 230026, China
| | - Qi Liu
- School of Earth and Space Sciences, CMA-USTC Laboratory of Fengyun Remote Sensing, University of Science and Technology of China, Hefei 230026, China.
| |
Collapse
|
6
|
Du T, Zheng W, Xu X. Composite attention mechanism network for deep contrastive multi-view clustering. Neural Netw 2024; 176:106361. [PMID: 38723307 DOI: 10.1016/j.neunet.2024.106361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 03/22/2024] [Accepted: 04/29/2024] [Indexed: 06/17/2024]
Abstract
Contrastive learning-based deep multi-view clustering methods have become a mainstream solution for unlabeled multi-view data. These methods usually utilize a basic structure that combines autoencoder, contrastive learning, or/and MLP projectors to generate more representative latent representations for the final clustering stage. However, existing deep contrastive multi-view clustering ignores two key points: (i) the latent representations projecting from one or more layers of MLP or new representations directly obtained from autoencoder fail to mine inherent relationship inner-view or cross-views; (ii) more existing frameworks only employ a one or dual-contrastive learning module, i.e., view- or/and category-oriented, which may result in the lack of communication between latent representations and clustering assignments. This paper proposes a new composite attention framework for contrastive multi-view clustering to address the above two challenges. Our method learns latent representations utilizing composite attention structure, i.e., Hierarchical Transformer for each view and Shared Attention for all views, rather than simple MLP. As a result, the learned representations can simultaneously preserve important features inside the view and balance the contributions across views. In addition, we add a new communication loss in our new dual contrastive framework. The common semantics will be brought into clustering assignments by pushing clustering assignments closer to the fused latent representations. Therefore, our method will provide a higher quality of clustering assignments for the segmentation problem of unlabeled multi-view data. The extensive experiments on several real data demonstrate that the proposed method can achieve superior performance over many state-of-the-art clustering algorithms, especially the significant improvement of an average of 10% on datasets Caltech and its subsets according to accuracy.
Collapse
Affiliation(s)
- Tingting Du
- School of Computer Science, Guangdong University of Science and Technology, Dongguan, 523083, China.
| | - Wei Zheng
- School of Computer Science, Wuhan University, Wuhan, 430072, China.
| | - Xingang Xu
- School of Computer Science, Guangdong University of Science and Technology, Dongguan, 523083, China.
| |
Collapse
|
7
|
Peng L, Hu R, Kong F, Gan J, Mo Y, Shi X, Zhu X. Reverse Graph Learning for Graph Neural Network. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:4530-4541. [PMID: 35380973 DOI: 10.1109/tnnls.2022.3161030] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Graph neural networks (GNNs) conduct feature learning by taking into account the local structure preservation of the data to produce discriminative features, but need to address the following issues, i.e., 1) the initial graph containing faulty and missing edges often affect feature learning and 2) most GNN methods suffer from the issue of out-of-example since their training processes do not directly generate a prediction model to predict unseen data points. In this work, we propose a reverse GNN model to learn the graph from the intrinsic space of the original data points as well as to investigate a new out-of-sample extension method. As a result, the proposed method can output a high-quality graph to improve the quality of feature learning, while the new method of out-of-sample extension makes our reverse GNN method available for conducting supervised learning and semi-supervised learning. Experimental results on real-world datasets show that our method outputs competitive classification performance, compared to state-of-the-art methods, in terms of semi-supervised node classification, out-of-sample extension, random edge attack, link prediction, and image retrieval.
Collapse
|
8
|
Shen X, Zhou Y, Yuan YH, Yang X, Lan L, Zheng Y. Contrastive Transformer Hashing for Compact Video Representation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:5992-6003. [PMID: 37903046 DOI: 10.1109/tip.2023.3326994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Video hashing learns compact representation by mapping video into low-dimensional Hamming space and has achieved promising performance in large-scale video retrieval. It is challenging to effectively exploit temporal and spatial structure in an unsupervised setting. To fulfill this gap, this paper proposes Contrastive Transformer Hashing (CTH) for effective video retrieval. Specifically, CTH develops a bidirectional transformer autoencoder, based on which visual reconstruction loss is proposed. CTH is more powerful to capture bidirectional correlations among frames than conventional unidirectional models. In addition, CTH devises multi-modality contrastive loss to reveal intrinsic structure among videos. CTH constructs inter-modality and intra-modality triplet sets and proposes multi-modality contrastive loss to exploit inter-modality and intra-modality similarities simultaneously. We perform video retrieval tasks on four benchmark datasets, i.e., UCF101, HMDB51, SVW30, FCVID using the learned compact hash representation, and extensive empirical results demonstrate the proposed CTH outperforms several state-of-the-art video hashing methods.
Collapse
|
9
|
Shen W, Song J, Zhu X, Li G, Shen HT. End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:5017-5030. [PMID: 37186535 DOI: 10.1109/tip.2023.3275071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.
Collapse
|
10
|
Yu T, Mascagni P, Verde J, Marescaux J, Mutter D, Padoy N. Live laparoscopic video retrieval with compressed uncertainty. Med Image Anal 2023; 88:102866. [PMID: 37356320 DOI: 10.1016/j.media.2023.102866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Revised: 04/14/2023] [Accepted: 06/07/2023] [Indexed: 06/27/2023]
Abstract
Searching through large volumes of medical data to retrieve relevant information is a challenging yet crucial task for clinical care. However the primitive and most common approach to retrieval, involving text in the form of keywords, is severely limited when dealing with complex media formats. Content-based retrieval offers a way to overcome this limitation, by using rich media as the query itself. Surgical video-to-video retrieval in particular is a new and largely unexplored research problem with high clinical value, especially in the real-time case: using real-time video hashing, search can be achieved directly inside of the operating room. Indeed, the process of hashing converts large data entries into compact binary arrays or hashes, enabling large-scale search operations at a very fast rate. However, due to fluctuations over the course of a video, not all bits in a given hash are equally reliable. In this work, we propose a method capable of mitigating this uncertainty while maintaining a light computational footprint. We present superior retrieval results (3%-4% top 10 mean average precision) on a multi-task evaluation protocol for surgery, using cholecystectomy phases, bypass phases, and coming from an entirely new dataset introduced here, surgical events across six different surgery types. Success on this multi-task benchmark shows the generalizability of our approach for surgical video retrieval.
Collapse
Affiliation(s)
- Tong Yu
- ICube, University of Strasbourg, CNRS, France; IHU Strasbourg, France.
| | - Pietro Mascagni
- ICube, University of Strasbourg, CNRS, France; IHU Strasbourg, France; Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
| | | | | | - Didier Mutter
- IHU Strasbourg, France; University Hospital of Strasbourg, France
| | - Nicolas Padoy
- ICube, University of Strasbourg, CNRS, France; IHU Strasbourg, France
| |
Collapse
|
11
|
Dilek E, Dener M. Computer Vision Applications in Intelligent Transportation Systems: A Survey. SENSORS (BASEL, SWITZERLAND) 2023; 23:2938. [PMID: 36991649 PMCID: PMC10051529 DOI: 10.3390/s23062938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 03/03/2023] [Accepted: 03/06/2023] [Indexed: 06/19/2023]
Abstract
As technology continues to develop, computer vision (CV) applications are becoming increasingly widespread in the intelligent transportation systems (ITS) context. These applications are developed to improve the efficiency of transportation systems, increase their level of intelligence, and enhance traffic safety. Advances in CV play an important role in solving problems in the fields of traffic monitoring and control, incident detection and management, road usage pricing, and road condition monitoring, among many others, by providing more effective methods. This survey examines CV applications in the literature, the machine learning and deep learning methods used in ITS applications, the applicability of computer vision applications in ITS contexts, the advantages these technologies offer and the difficulties they present, and future research areas and trends, with the goal of increasing the effectiveness, efficiency, and safety level of ITS. The present review, which brings together research from various sources, aims to show how computer vision techniques can help transportation systems to become smarter by presenting a holistic picture of the literature on different CV applications in the ITS context.
Collapse
|
12
|
Kordopatis-Zilos G, Tzelepis C, Papadopoulos S, Kompatsiaris I, Patras I. DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval. Int J Comput Vis 2022. [DOI: 10.1007/s11263-022-01651-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
AbstractIn this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, called Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: (a) Student Networks at different retrieval performance and computational efficiency trade-offs and (b) a Selector Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store/index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets—this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate (a) that our students achieve state-of-the-art performance in several cases and (b) that the DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, the proposed method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. The collected dataset and implementation are publicly available: https://github.com/mever-team/distill-and-select.
Collapse
|
13
|
Wang G, Hu Q, Yang Y, Cheng J, Hou ZG. Adversarial Binary Mutual Learning for Semi-Supervised Deep Hashing. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4110-4124. [PMID: 33684043 DOI: 10.1109/tnnls.2021.3055834] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Hashing is a popular search algorithm for its compact binary representation and efficient Hamming distance calculation. Benefited from the advance of deep learning, deep hashing methods have achieved promising performance. However, those methods usually learn with expensive labeled data but fail to utilize unlabeled data. Furthermore, the traditional pairwise loss used by those methods cannot explicitly force similar/dissimilar pairs to small/large distances. Both weaknesses limit existing methods' performance. To solve the first problem, we propose a novel semi-supervised deep hashing model named adversarial binary mutual learning (ABML). Specifically, our ABML consists of a generative model GH and a discriminative model DH , where DH learns labeled data in a supervised way and GH learns unlabeled data by synthesizing real images. We adopt an adversarial learning (AL) strategy to transfer the knowledge of unlabeled data to DH by making GH and DH mutually learn from each other. To solve the second problem, we propose a novel Weibull cross-entropy loss (WCE) by using the Weibull distribution, which can distinguish tiny differences of distances and explicitly force similar/dissimilar distances as small/large as possible. Thus, the learned features are more discriminative. Finally, by incorporating ABML with WCE loss, our model can acquire more semantic and discriminative features. Extensive experiments on four common data sets (CIFAR-10, large database of handwritten digits (MNIST), ImageNet-10, and NUS-WIDE) and a large-scale data set ImageNet demonstrate that our approach successfully overcomes the two difficulties above and significantly outperforms state-of-the-art hashing methods.
Collapse
|
14
|
Shi W, Gong Y, Chen B, Hei X. Transductive Semisupervised Deep Hashing. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:3713-3726. [PMID: 33544678 DOI: 10.1109/tnnls.2021.3054386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Deep hashing methods have shown their superiority to traditional ones. However, they usually require a large amount of labeled training data for achieving high retrieval accuracies. We propose a novel transductive semisupervised deep hashing (TSSDH) method which is effective to train deep convolutional neural network (DCNN) models with both labeled and unlabeled training samples. TSSDH method consists of the following four main ingredients. First, we extend the traditional transductive learning (TL) principle to make it applicable to DCNN-based deep hashing. Second, we introduce confidence levels for unlabeled samples to reduce adverse effects from uncertain samples. Third, we employ a Gaussian likelihood loss for hash code learning to sufficiently penalize large Hamming distances for similar sample pairs. Fourth, we design the large-margin feature (LMF) regularization to make the learned features satisfy that the distances of similar sample pairs are minimized and the distances of dissimilar sample pairs are larger than a predefined margin. Comprehensive experiments show that the TSSDH method can produce superior image retrieval accuracies compared to the representative semisupervised deep hashing methods under the same number of labeled training samples.
Collapse
|
15
|
Xie T, Deng Y, Zhang J, Zhang Z, Hu Z, Wu T. DNA circuits compatible encoder and demultiplexer based on a single biomolecular platform with DNA strands as outputs. Nucleic Acids Res 2022; 50:8431-8440. [PMID: 35904810 PMCID: PMC9410916 DOI: 10.1093/nar/gkac650] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Revised: 07/09/2022] [Accepted: 07/20/2022] [Indexed: 11/24/2022] Open
Abstract
A series of multiple logic circuits based on a single biomolecular platform is constructed to perform nonarithmetic and arithmetic functions, including 4-to-2 encoder, 1-to-2 demultiplexer, 1-to-4 demultiplexer, and multi-input OR gate. The encoder to a DNA circuit is the equivalent of a sensory receptor to a reflex arc. They all function to encode information from outside the pathway (DNA circuit or reflex arc) into a form that subsequent pathways can recognize and utilize. Current molecular encoders are based on optical or electrical signals as outputs, while DNA circuits are based on DNA strands as transmission signals. The output of existing encoders cannot be recognized by subsequent DNA circuits. It is the first time the DNA-based encoder with DNA strands as outputs can be truly applied to the DNA circuit, enabling the application of DNA circuits in non-binary biological environments. Another novel feature of the designed system is that the developed nanodevices all have a simple structure, low leakage and low crosstalk, which allows them to implement higher-level encoders and demultiplexers easily. Our work is based on the idea of complex functionality in a simple form, which will also provide a new route for developing advanced molecular logic circuits.
Collapse
Affiliation(s)
- Tianci Xie
- School of Pharmacy, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Yuhan Deng
- School of Pharmacy, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Jiarui Zhang
- School of Pharmacy, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Zhen Zhang
- School of Pharmacy, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Zhe Hu
- School of Pharmacy, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Tongbo Wu
- School of Pharmacy, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| |
Collapse
|
16
|
Guo Y, Gao L, Song J, Wang P, Sebe N, Shen HT, Li X. Relation Regularized Scene Graph Generation. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:5961-5972. [PMID: 33710964 DOI: 10.1109/tcyb.2021.3052522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations for describing the image content abstraction. Existing works have revealed that if the links between objects are given as prior knowledge, the performance of SGG is significantly improved. Inspired by this observation, in this article, we propose a relation regularized network (R2-Net), which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG. Specifically, we first construct an affinity matrix among detected objects to represent the probability of a relationship between two objects. Graph convolution networks (GCNs) over this relation affinity matrix are then used as object encoders, producing relation-regularized representations of objects. With these relation-regularized features, our R2-Net can effectively refine object labels and generate scene graphs. Extensive experiments are conducted on the visual genome dataset for three SGG tasks (i.e., predicate classification, scene graph classification, and scene graph detection), demonstrating the effectiveness of our proposed method. Ablation studies also verify the key roles of our proposed components in performance improvement.
Collapse
|
17
|
R J, Refaee EA, K S, Hossain MA, Soundrapandiyan R, Karuppiah M. Biomedical image retrieval using adaptive neuro-fuzzy optimized classifier system. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:8132-8151. [PMID: 35801460 DOI: 10.3934/mbe.2022380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The quantity of scientific images associated with patient care has increased markedly in recent years due to the rapid development of hospitals and research facilities. Every hospital generates more medical photographs, resulting in more than 10 GB of data per day being produced by a single image appliance. Software is used extensively to scan and locate diagnostic photographs to identify patient's precise information, which can be valuable for medical science research and advancement. An image recovery system is used to meet this need. This paper suggests an optimized classifier framework focused on a hybrid adaptive neuro-fuzzy approach to accomplish this goal. In the user query, similarity measurement, and the image content, fuzzy sets represent the vagueness that occurs in such data sets. The optimized classifying method 'hybrid adaptive neuro-fuzzy is enhanced with the improved cuckoo search optimization. Score values are determined by utilizing the linear discriminant analysis (LDA) of such classified images. The preliminary findings indicate that the proposed approach can be more reliable and effective at estimation than can existing approaches.
Collapse
Affiliation(s)
- Janarthanan R
- Centre for Artificial Intelligence, Department of Computer Science and Engineering, Chennai Institute of Technology, Chennai 600069, India
| | - Eshrag A Refaee
- Department of Computer Science, College of Computer Science & Information Technology, Jazan University, Jazan, Kingdom of Saudi Arabia
| | - Selvakumar K
- Department of Computer Applications, National Institute of Technology (NIT), Tiruchirappalli 620015, India
| | - Mohammad Alamgir Hossain
- Department of Computer Science, College of Computer Science & Information Technology, Jazan University, Jazan, Kingdom of Saudi Arabia
| | - Rajkumar Soundrapandiyan
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Marimuthu Karuppiah
- Department of Computer Science and Engineering, SRM Institute of Science and Technology, NCR Campus, Ghaziabad 201204, Uttar Pradesh, India
| |
Collapse
|
18
|
Lin M, Ji R, Sun X, Zhang B, Huang F, Tian Y, Tao D. Fast Class-Wise Updating for Online Hashing. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:2453-2467. [PMID: 33270558 DOI: 10.1109/tpami.2020.3042193] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Online image hashing has received increasing research attention recently, which processes large-scale data in a streaming fashion to update the hash functions on-the-fly. To this end, most existing works exploit this problem under a supervised setting, i.e., using class labels to boost the hashing performance, which suffers from the defects in both adaptivity and efficiency: First, large amounts of training batches are required to learn up-to-date hash functions, which leads to poor online adaptivity. Second, the training is time-consuming, which contradicts with the core need of online learning. In this paper, a novel supervised online hashing scheme, termed Fast Class-wise Updating for Online Hashing (FCOH), is proposed to address the above two challenges by introducing a novel and efficient inner product operation. To achieve fast online adaptivity, a class-wise updating method is developed to decompose the binary code learning and alternatively renew the hash functions in a class-wise fashion, which well addresses the burden on large amounts of training batches. Quantitatively, such a decomposition further leads to at least 75 percent storage saving. To further achieve online efficiency, we propose a semi-relaxation optimization, which accelerates the online training by treating different binary constraints independently. Without additional constraints and variables, the time complexity is significantly reduced. Such a scheme is also quantitatively shown to well preserve past information during updating hashing functions. We have quantitatively demonstrated that the collective effort of class-wise updating and semi-relaxation optimization provides a superior performance comparing to various state-of-the-art methods, which is verified through extensive experiments on three widely-used datasets.
Collapse
|
19
|
Zahra A, Ghafoor M, Munir K, Ullah A, Ul Abideen Z. Application of region-based video surveillance in smart cities using deep learning. MULTIMEDIA TOOLS AND APPLICATIONS 2021; 83:1-26. [PMID: 34975282 PMCID: PMC8710820 DOI: 10.1007/s11042-021-11468-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 05/23/2021] [Accepted: 08/19/2021] [Indexed: 06/14/2023]
Abstract
Smart video surveillance helps to build more robust smart city environment. The varied angle cameras act as smart sensors and collect visual data from smart city environment and transmit it for further visual analysis. The transmitted visual data is required to be in high quality for efficient analysis which is a challenging task while transmitting videos on low capacity bandwidth communication channels. In latest smart surveillance cameras, high quality of video transmission is maintained through various video encoding techniques such as high efficiency video coding. However, these video coding techniques still provide limited capabilities and the demand of high-quality based encoding for salient regions such as pedestrians, vehicles, cyclist/motorcyclist and road in video surveillance systems is still not met. This work is a contribution towards building an efficient salient region-based surveillance framework for smart cities. The proposed framework integrates a deep learning-based video surveillance technique that extracts salient regions from a video frame without information loss, and then encodes it in reduced size. We have applied this approach in diverse case studies environments of smart city to test the applicability of the framework. The successful result in terms of bitrate 56.92%, peak signal to noise ratio 5.35 bd and SR based segmentation accuracy of 92% and 96% for two different benchmark datasets is the outcome of proposed work. Consequently, the generation of less computational region-based video data makes it adaptable to improve surveillance solution in Smart Cities.
Collapse
Affiliation(s)
- Asma Zahra
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
- Department of Computer Science, National University of Modern Languages, Islamabad, Pakistan
| | - Mubeen Ghafoor
- School of Computer Science, University of Lincoln, Lincoln, UK
| | - Kamran Munir
- Department of Computer Science and Creative Technologies (CSCT), University of the West of England (UWE), Bristol, UK
| | - Ata Ullah
- Department of Computer Science, National University of Modern Languages, Islamabad, Pakistan
| | - Zain Ul Abideen
- Department of Computer Science, National University of Modern Languages, Islamabad, Pakistan
| |
Collapse
|
20
|
Unsupervised feature disentanglement for video retrieval in minimally invasive surgery. Med Image Anal 2021; 75:102296. [PMID: 34781159 DOI: 10.1016/j.media.2021.102296] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 10/19/2021] [Accepted: 10/27/2021] [Indexed: 11/23/2022]
Abstract
In this paper, we propose a novel method of Unsupervised Disentanglement of Scene and Motion (UDSM) representations for minimally invasive surgery video retrieval within large databases, which has the potential to advance intelligent and efficient surgical teaching systems. To extract more discriminative video representations, two designed encoders with a triplet ranking loss and an adversarial learning mechanism are established to respectively capture the spatial and temporal information for achieving disentangled features from each frame with promising interpretability. In addition, the long-range temporal dependencies are improved in an integrated video level using a temporal aggregation module and then a set of compact binary codes that carries representative features is yielded to realize fast retrieval. The entire framework is trained in an unsupervised scheme, i.e., purely learning from raw surgical videos without using any annotation. We construct two large-scale minimally invasive surgery video datasets based on the public dataset Cholec80 and our in-house dataset of laparoscopic hysterectomy, to establish the learning process and validate the effectiveness of our proposed method qualitatively and quantitatively on the surgical video retrieval task. Extensive experiments show that our approach significantly outperforms the state-of-the-art video retrieval methods on both datasets, revealing a promising future for injecting intelligence in the next generation of surgical teaching systems.
Collapse
|
21
|
Decomposing normal and abnormal features of medical images for content-based image retrieval of glioma imaging. Med Image Anal 2021; 74:102227. [PMID: 34543911 DOI: 10.1016/j.media.2021.102227] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 09/05/2021] [Accepted: 09/06/2021] [Indexed: 11/20/2022]
Abstract
In medical imaging, the characteristics purely derived from a disease should reflect the extent to which abnormal findings deviate from the normal features. Indeed, physicians often need corresponding images without abnormal findings of interest or, conversely, images that contain similar abnormal findings regardless of normal anatomical context. This is called comparative diagnostic reading of medical images, which is essential for a correct diagnosis. To support comparative diagnostic reading, content-based image retrieval (CBIR) that can selectively utilize normal and abnormal features in medical images as two separable semantic components will be useful. In this study, we propose a neural network architecture to decompose the semantic components of medical images into two latent codes: normal anatomy code and abnormal anatomy code. The normal anatomy code represents counterfactual normal anatomies that should have existed if the sample is healthy, whereas the abnormal anatomy code attributes to abnormal changes that reflect deviation from the normal baseline. By calculating the similarity based on either normal or abnormal anatomy codes or the combination of the two codes, our algorithm can retrieve images according to the selected semantic component from a dataset consisting of brain magnetic resonance images of gliomas. Moreover, it can utilize a synthetic query vector combining normal and abnormal anatomy codes from two different query images. To evaluate whether the retrieved images are acquired according to the targeted semantic component, the overlap of the ground-truth labels is calculated as metrics of the semantic consistency. Our algorithm provides a flexible CBIR framework by handling the decomposed features with qualitatively and quantitatively remarkable results.
Collapse
|
22
|
Wang Y, Nie X, Shi Y, Zhou X, Yin Y. Attention-Based Video Hashing for Large-Scale Video Retrieval. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2019.2963339] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
23
|
Nie X, Zhou X, Shi Y, Sun J, Yin Y. Classification-enhancement deep hashing for large-scale video retrieval. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107467] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
24
|
Wang S, Li C, Shen HL. Equivalent Continuous Formulation of General Hashing Problem. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4089-4099. [PMID: 30714940 DOI: 10.1109/tcyb.2019.2894020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Hashing-based approximate nearest neighbors search has attracted broad research interest, due to its low computational cost and fast retrieval speed. The hashing technique maps the data points into binary codes and, meanwhile, preserves the similarity in the original space. Generally, we need to solve a discrete optimization problem to learn the binary codes and hash functions, which is NP-hard. In the literature, most hashing methods choose to solve a relaxed problem by discarding the discrete constraints. However, such a relaxation scheme will cause large quantization error, which makes the learned binary codes less effective. In this paper, we present an equivalent continuous formulation of the discrete hashing problem. Specifically, we show that the discrete hashing problem can be transformed into a continuous optimization problem without any relaxations, while the transformed continuous optimization problem has the same optimal solutions and the same optimal value as the original discrete hashing problem. After transformation, the continuous optimization methods can be applied. We devise the algorithms based on the idea of DC (difference of convex functions) programming to solve this problem. The proposed continuous hashing scheme can be easily applied to the existing hashing models, including both supervised and unsupervised hashing. We evaluate the proposed method on several benchmarks and the results show the superiority of the proposed method compared with the state-of-the-art hashing methods.
Collapse
|
25
|
Chen H, Hu C, Lee F, Lin C, Yao W, Chen L, Chen Q. A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for Large-Scale Video Retrieval. SENSORS 2021; 21:s21093094. [PMID: 33946745 PMCID: PMC8124307 DOI: 10.3390/s21093094] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 04/20/2021] [Accepted: 04/26/2021] [Indexed: 11/24/2022]
Abstract
Recently, with the popularization of camera tools such as mobile phones and the rise of various short video platforms, a lot of videos are being uploaded to the Internet at all times, for which a video retrieval system with fast retrieval speed and high precision is very necessary. Therefore, content-based video retrieval (CBVR) has aroused the interest of many researchers. A typical CBVR system mainly contains the following two essential parts: video feature extraction and similarity comparison. Feature extraction of video is very challenging, previous video retrieval methods are mostly based on extracting features from single video frames, while resulting the loss of temporal information in the videos. Hashing methods are extensively used in multimedia information retrieval due to its retrieval efficiency, but most of them are currently only applied to image retrieval. In order to solve these problems in video retrieval, we build an end-to-end framework called deep supervised video hashing (DSVH), which employs a 3D convolutional neural network (CNN) to obtain spatial-temporal features of videos, then train a set of hash functions by supervised hashing to transfer the video features into binary space and get the compact binary codes of videos. Finally, we use triplet loss for network training. We conduct a lot of experiments on three public video datasets UCF-101, JHMDB and HMDB-51, and the results show that the proposed method has advantages over many state-of-the-art video retrieval methods. Compared with the DVH method, the mAP value of UCF-101 dataset is improved by 9.3%, and the minimum improvement on JHMDB dataset is also increased by 0.3%. At the same time, we also demonstrate the stability of the algorithm in the HMDB-51 dataset.
Collapse
Affiliation(s)
- Hanqing Chen
- Shanghai Engineering Research Center of Assistive Devices, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; (H.C.); (C.L.); (W.Y.); (L.C.)
| | - Chunyan Hu
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China;
| | - Feifei Lee
- Shanghai Engineering Research Center of Assistive Devices, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; (H.C.); (C.L.); (W.Y.); (L.C.)
- Correspondence: (F.L.); (Q.C.)
| | - Chaowei Lin
- Shanghai Engineering Research Center of Assistive Devices, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; (H.C.); (C.L.); (W.Y.); (L.C.)
| | - Wei Yao
- Shanghai Engineering Research Center of Assistive Devices, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; (H.C.); (C.L.); (W.Y.); (L.C.)
| | - Lu Chen
- Shanghai Engineering Research Center of Assistive Devices, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; (H.C.); (C.L.); (W.Y.); (L.C.)
| | - Qiu Chen
- Major of Electrical Engineering and Electronics, Graduate School of Engineering, Kogakuin University, Tokyo 163-8677, Japan
- Correspondence: (F.L.); (Q.C.)
| |
Collapse
|
26
|
Shao H, Zhong D, Du X. A deep biometric hash learning framework for three advanced hand‐based biometrics. IET BIOMETRICS 2021. [DOI: 10.1049/bme2.12014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Huikai Shao
- School of Automation Science and Engineering Xi'an Jiaotong University Xi'an Shaanxi China
| | - Dexing Zhong
- School of Automation Science and Engineering Xi'an Jiaotong University Xi'an Shaanxi China
- State Key Lab. for Novel Software Technology Nanjing University Nanjing China
- Pazhou Lab Guangzhou China
| | - Xuefeng Du
- School of Automation Science and Engineering Xi'an Jiaotong University Xi'an Shaanxi China
| |
Collapse
|
27
|
Rossi A, Hosseinzadeh M, Bianchini M, Scarselli F, Huisman H. Multi-Modal Siamese Network for Diagnostically Similar Lesion Retrieval in Prostate MRI. IEEE TRANSACTIONS ON MEDICAL IMAGING 2021; 40:986-995. [PMID: 33296302 DOI: 10.1109/tmi.2020.3043641] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Multi-parametric prostate MRI (mpMRI) is a powerful tool to diagnose prostate cancer, though difficult to interpret even for experienced radiologists. A common radiological procedure is to compare a magnetic resonance image with similarly diagnosed cases. To assist the radiological image interpretation process, computerized Content-Based Image Retrieval systems (CBIRs) can therefore be employed to improve the reporting workflow and increase its accuracy. In this article, we propose a new, supervised siamese deep learning architecture able to handle multi-modal and multi-view MR images with similar PIRADS score. An experimental comparison with well-established deep learning-based CBIRs (namely standard siamese networks and autoencoders) showed significantly improved performance with respect to both diagnostic (ROC-AUC), and information retrieval metrics (Precision-Recall, Discounted Cumulative Gain and Mean Average Precision). Finally, the new proposed multi-view siamese network is general in design, facilitating a broad use in diagnostic medical imaging retrieval.
Collapse
|
28
|
Qi M, Qin J, Yang Y, Wang Y, Luo J. Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:2989-3004. [PMID: 33560984 DOI: 10.1109/tip.2020.3048680] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
With the current exponential growth of video-based social networks, video retrieval using natural language is receiving ever-increasing attention. Most existing approaches tackle this task by extracting individual frame-level spatial features to represent the whole video, while ignoring visual pattern consistencies and intrinsic temporal relationships across different frames. Furthermore, the semantic correspondence between natural language queries and person-centric actions in videos has not been fully explored. To address these problems, we propose a novel binary representation learning framework, named Semantics-aware Spatial-temporal Binaries ( [Formula: see text]Bin), which simultaneously considers spatial-temporal context and semantic relationships for cross-modal video retrieval. By exploiting the semantic relationships between two modalities, [Formula: see text]Bin can efficiently and effectively generate binary codes for both videos and texts. In addition, we adopt an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training. We evaluate our model on three video datasets and the experimental results demonstrate that [Formula: see text]Bin outperforms the state-of-the-art methods in terms of various cross-modal video retrieval tasks.
Collapse
|
29
|
An Overview of Image Caption Generation Methods. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2020:3062706. [PMID: 32377178 PMCID: PMC7199544 DOI: 10.1155/2020/3062706] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Revised: 12/10/2019] [Accepted: 12/11/2019] [Indexed: 11/26/2022]
Abstract
In recent years, with the rapid development of artificial intelligence, image caption has gradually attracted the attention of many researchers in the field of artificial intelligence and has become an interesting and arduous task. Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing. The application of image caption is extensive and significant, for example, the realization of human-computer interaction. This paper summarizes the related methods and focuses on the attention mechanism, which plays an important role in computer vision and is recently widely used in image caption generation tasks. Furthermore, the advantages and the shortcomings of these methods are discussed, providing the commonly used datasets and evaluation criteria in this field. Finally, this paper highlights some open challenges in the image caption task.
Collapse
|
30
|
Kim D, Pathak S, Moro A, Yamashita A, Asama H. Self-supervised optical flow derotation network for rotation estimation of a spherical camera. Adv Robot 2020. [DOI: 10.1080/01691864.2020.1857305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Dabae Kim
- Department of Precision Engineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | - Sarthak Pathak
- Department of Precision Engineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | - Alessandro Moro
- Department of Precision Engineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | - Atsushi Yamashita
- Department of Precision Engineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | - Hajime Asama
- Department of Precision Engineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
31
|
Yan X, Gilani SZ, Feng M, Zhang L, Qin H, Mian A. Self-Supervised Learning to Detect Key Frames in Videos. SENSORS 2020; 20:s20236941. [PMID: 33291759 PMCID: PMC7731244 DOI: 10.3390/s20236941] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 11/23/2020] [Accepted: 11/30/2020] [Indexed: 11/16/2022]
Abstract
Detecting key frames in videos is a common problem in many applications such as video classification, action recognition and video summarization. These tasks can be performed more efficiently using only a handful of key frames rather than the full video. Existing key frame detection approaches are mostly designed for supervised learning and require manual labelling of key frames in a large corpus of training data to train the models. Labelling requires human annotators from different backgrounds to annotate key frames in videos which is not only expensive and time consuming but also prone to subjective errors and inconsistencies between the labelers. To overcome these problems, we propose an automatic self-supervised method for detecting key frames in a video. Our method comprises a two-stream ConvNet and a novel automatic annotation architecture able to reliably annotate key frames in a video for self-supervised learning of the ConvNet. The proposed ConvNet learns deep appearance and motion features to detect frames that are unique. The trained network is then able to detect key frames in test videos. Extensive experiments on UCF101 human action and video summarization VSUMM datasets demonstrates the effectiveness of our proposed method.
Collapse
Affiliation(s)
- Xiang Yan
- School of Physics and Optoelectronic Engineering, Xidian University, Xi’an 710071, China; (X.Y.); (H.Q.)
| | | | - Mingtao Feng
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (M.F.); (L.Z.)
| | - Liang Zhang
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (M.F.); (L.Z.)
| | - Hanlin Qin
- School of Physics and Optoelectronic Engineering, Xidian University, Xi’an 710071, China; (X.Y.); (H.Q.)
| | - Ajmal Mian
- Computer Science and Software Engineering, University of Western Australia, Crawley 6009, Australia;
| |
Collapse
|
32
|
Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:5412-5425. [PMID: 32071004 DOI: 10.1109/tnnls.2020.2967597] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The task of image-text matching refers to measuring the visual-semantic similarity between an image and a sentence. Recently, the fine-grained matching methods that explore the local alignment between the image regions and the sentence words have shown advance in inferring the image-text correspondence by aggregating pairwise region-word similarity. However, the local alignment is hard to achieve as some important image regions may be inaccurately detected or even missing. Meanwhile, some words with high-level semantics cannot be strictly corresponding to a single-image region. To tackle these problems, we address the importance of exploiting the global semantic consistence between image regions and sentence words as complementary for the local alignment. In this article, we propose a novel hybrid matching approach named Cross-modal Attention with Semantic Consistency (CASC) for image-text matching. The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence. It directly extracts semantic labels from available sentence corpus without additional labor cost, which further provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment. Extensive experiments on Flickr30k and Microsoft COCO (MSCOCO) data sets demonstrate the effectiveness of the proposed CASC on preserving global semantic consistence along with the local alignment and further show its superior image-text matching performance compared with more than 15 state-of-the-art methods.
Collapse
|
33
|
DSRPH: Deep semantic-aware ranking preserving hashing for efficient multi-label image retrieval. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2020.05.114] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
34
|
|
35
|
Li R, Zhang X, Chen G, Mao Y, Wang X. Multi-negative samples with Generative Adversarial Networks for image retrieval. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2018.10.110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
36
|
Gao L, Li X, Song J, Shen HT. Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2020; 42:1112-1131. [PMID: 30668467 DOI: 10.1109/tpami.2019.2894139] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g., "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we first instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.
Collapse
|
37
|
Gao L, Cao L, Xu X, Shao J, Song J. Question-Led object attention for visual question answering. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2018.11.102] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
38
|
|
39
|
Shi X, Guo Z, Xing F, Liang Y, Yang L. Anchor-Based Self-Ensembling for Semi-Supervised Deep Pairwise Hashing. Int J Comput Vis 2020. [DOI: 10.1007/s11263-020-01299-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
40
|
|
41
|
|
42
|
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT. From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:3047-3058. [PMID: 30130235 DOI: 10.1109/tnnls.2018.2851077] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Video captioning, in essential, is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, and so on. In this paper, we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multimodal stochastic recurrent neural networks (MS-RNNs), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning and generate multiple sentences to describe a video considering different random factors. Specifically, a multimodal long short-term memory (LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging data sets, microsoft video description and microsoft research video-to-text, show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.
Collapse
|
43
|
Xu Y, Yang J, Mao K. Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.05.027] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
44
|
Liu Y, Song J, Zhou K, Yan L, Liu L, Zou F, Shao L. Deep Self-Taught Hashing for Image Retrieval. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2229-2241. [PMID: 29994014 DOI: 10.1109/tcyb.2018.2822781] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Hashing algorithm has been widely used to speed up image retrieval due to its compact binary code and fast distance calculation. The combination with deep learning boosts the performance of hashing by learning accurate representations and complicated hashing functions. So far, the most striking success in deep hashing have mostly involved discriminative models, which require labels. To apply deep hashing on datasets without labels, we propose a deep self-taught hashing algorithm (DSTH), which generates a set of pseudo labels by analyzing the data itself, and then learns the hash functions for novel data using discriminative deep models. Furthermore, we generalize DSTH to support both supervised and unsupervised cases by adaptively incorporating label information. We use two different deep learning framework to train the hash functions to deal with out-of-sample problem and reduce the time complexity without loss of accuracy. We have conducted extensive experiments to investigate different settings of DSTH, and compared it with state-of-the-art counterparts in six publicly available datasets. The experimental results show that DSTH outperforms the others in all datasets.
Collapse
|
45
|
A Novel Tri-Training Technique for the Semi-Supervised Classification of Hyperspectral Images Based on Regularized Local Discriminant Embedding Feature Extraction. REMOTE SENSING 2019. [DOI: 10.3390/rs11060654] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
This paper introduces a novel semi-supervised tri-training classification algorithm based on regularized local discriminant embedding (RLDE) for hyperspectral imagery. In this algorithm, the RLDE method is used for optimal feature information extraction, to solve the problems of singular values and over-fitting, which are the main problems in the local discriminant embedding (LDE) and local Fisher discriminant analysis (LFDA) methods. An active learning method is then used to select the most useful and informative samples from the candidate set. In the experiments undertaken in this study, the three base classifiers were multinomial logistic regression (MLR), k-nearest neighbor (KNN), and random forest (RF). To confirm the effectiveness of the proposed RLDE method, experiments were conducted on two real hyperspectral datasets (Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) and Reflective Optics System Imaging Spectrometer (ROSIS)), and the proposed RLDE tri-training algorithm was compared with its counterparts of tri-training alone, LDE, and LFDA. The experiments confirmed that the proposed approach can effectively improve the classification accuracy for hyperspectral imagery.
Collapse
|
46
|
Wu G, Han J, Guo Y, Liu L, Ding G, Ni Q, Shao L. Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 28:1993-2007. [PMID: 30452370 DOI: 10.1109/tip.2018.2882155] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
This paper proposes a deep hashing framework, namely Unsupervised Deep Video Hashing (UDVH), for largescale video similarity search with the aim to learn compact yet effective binary codes. Our UDVH produces the hash codes in a self-taught manner by jointly integrating discriminative video representation with optimal code learning, where an efficient alternating approach is adopted to optimize the objective function. The key differences from most existing video hashing methods lie in 1) UDVH is an unsupervised hashing method that generates hash codes by cooperatively utilizing feature clustering and a specifically-designed binarization with the original neighborhood structure preserved in the binary space; 2) a specific rotation is developed and applied onto video features such that the variance of each dimension can be balanced, thus facilitating the subsequent quantization step. Extensive experiments performed on three popular video datasets show that UDVH is overwhelmingly better than the state-of-the-arts in terms of various evaluation metrics, which makes it practical in real-world applications.
Collapse
|
47
|
Hong W, Yuan J. Fried Binary Embedding: From High-Dimensional Visual Features to High-Dimensional Binary Codes. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:4825-4837. [PMID: 29969394 DOI: 10.1109/tip.2018.2846670] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Most existing binary embedding methods prefer compact binary codes ( -dimensional) to avoid high computational and memory cost of projecting high-dimensional visual features ( -dimensional, ). We argue that long binary codes ( ) are critical to fully utilize the discriminative power of high-dimensional visual features, and can achieve better results in various tasks such as approximate nearest neighbor search. Generating long binary codes involves large projection matrix and high-dimensional matrix-vector multiplication, thus is memory and compute intensive. We propose Fried binary embedding (FBE) and Supervised Fried Binary Embedding (SuFBE), to tackle these problems. FBE is suitable for most of the practical applications in which the labels of training data are not given, while SuFBE can significantly boost the accuracy in the cases that the training labels are available. The core idea is to decompose the projection matrix using adaptive Fastfood transform, which is the multiplication of several structured matrices. As a result, FBE and SuFBE can reduce the computational complexity from to , and memory cost from to , respectively. More importantly, by using the structured matrices, FBE and SuFBE can well regulate projection matrix by reducing its tunable parameters and lead to even better accuracy than using either unconstrained projection matrix (like ITQ) or sparse matrix such as SP and SSP with the same long code length. Experimental comparisons with state-of-the-art methods over various visual applications demonstrate both the efficiency and performance advantages of FBE and SuFBE.
Collapse
|