1
|
Lai L, Chen J, Zhang Z, Lin G, Wu Q. CMFAN: Cross-Modal Feature Alignment Network for Few-Shot Single-View 3D Reconstruction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5522-5534. [PMID: 38593016 DOI: 10.1109/tnnls.2024.3383039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Few-shot single-view 3D reconstruction learns to reconstruct the novel category objects based on a query image and a few support shapes. However, since the query image and the support shapes are of different modalities, there is an inherent feature misalignment problem damaging the reconstruction. Previous works in the literature do not consider this problem. To this end, we propose the cross-modal feature alignment network (CMFAN) with two novel techniques. One is a strategy for model pretraining, namely, cross-modal contrastive learning (CMCL), here the 2D images and 3D shapes of the same objects compose the positives, and those from different objects form the negatives. With CMCL, the model learns to embed the 2D and 3D modalities of the same object into a tight area in the feature space and push away those from different objects, thus effectively aligning the global cross-modal features. The other is cross-modal feature fusion (CMFF), which further aligns and fuses the local features. Specifically, it first re-represents the local features with the cross-attention operation, making the local features share more information. Then, CMFF generates a descriptor for the support features and attaches it to each local feature vector of the query image with dense concatenation. Moreover, CMFF can be applied to multilevel local features and brings further advantages. We conduct extensive experiments to evaluate the effectiveness of our designs, and CMFAN sets new state-of-the-art performance in all of the 1-/10-/25-shot tasks of ShapeNet and ModelNet datasets.
Collapse
|
2
|
Liu X, Liu X, Li G, Bi S. Pose and Color-Gamut Guided Generative Adversarial Network for Pedestrian Image Synthesis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:10724-10736. [PMID: 35584072 DOI: 10.1109/tnnls.2022.3171245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Tremendous transfer requirements in pedestrian reidentification (Re-ID) tasks have greatly promoted the remarkable success in pedestrian image synthesis, to relieve the inconsistency in poses and lighting. However, existing approaches are confined to transferring in a particular domain and are difficult to combine, since pose and color variables locate in two independent domains. To facilitate the research toward conquering this issue, we propose a pose and color-gamut guided generative adversarial network (PC-GAN) that performs joint-domain pedestrian image synthesis conditioned on certain pose and color-gamut through a delicate supervision design. The generator of the network comprises a sequence of cross-domain conversion subnets, where the local displacement estimator, color-gamut transformer, and pose transporter coordinate their learning pace to progressively synthesize images in desired pose and color-gamut. Ablation studies have demonstrated the efficacy and efficiency of the proposed network both qualitatively and quantitatively on Market-1501 and DukeMTMC. Furthermore, the proposed architecture can generate training images for person Re-ID, alleviating the data insufficiency problem.
Collapse
|
3
|
Liu D, Wu L, Zheng F, Liu L, Wang M. Verbal-Person Nets: Pose-Guided Multi-Granularity Language-to-Person Generation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:8589-8601. [PMID: 35263259 DOI: 10.1109/tnnls.2022.3151631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Person image generation conditioned on natural language allows us to personalize image editing in a user-friendly manner. This fashion, however, involves different granularities of semantic relevance between texts and visual content. Given a sentence describing an unknown person, we propose a novel pose-guided multi-granularity attention architecture to synthesize the person image in an end-to-end manner. To determine what content to draw at a global outline, the sentence-level description and pose feature maps are incorporated into a U-Net architecture to generate a coarse person image. To further enhance the fine-grained details, we propose to draw the human body parts with highly correlated textual nouns and determine the spatial positions with respect to target pose points. Our model is premised on a conditional generative adversarial network (GAN) that translates language description into a realistic person image. The proposed model is coupled with two-stream discriminators: 1) text-relevant local discriminators to improve the fine-grained appearance by identifying the region-text correspondences at the finer manipulation and 2) a global full-body discriminator to regulate the generation via a pose-weighting feature selection. Extensive experiments conducted on benchmarks validate the superiority of our method for person image generation.
Collapse
|
4
|
Zhu K, Guo H, Liu S, Wang J, Tang M. Learning Semantics-Consistent Stripes With Self-Refinement for Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:8531-8542. [PMID: 35298384 DOI: 10.1109/tnnls.2022.3151487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Aligning human parts automatically is one of the most challenging problems for person re-identification (re-ID). Recently, the stripe-based methods, which equally partition the person images into the fixed stripes for aligned representation learning, have achieved great success. However, the stripes with fixed height and position cannot well handle the misalignment problems caused by inaccurate detection and occlusion and may introduce much background noise. In this article, we aim at learning adaptive stripes with foreground refinement to achieve pixel-level part alignment by only using person identity labels for person re-ID and make two contributions. 1) A semantics-consistent stripe learning method (SCS). Given an image, SCS partitions it into adaptive horizontal stripes and each stripe is corresponding to a specific semantic part. Specifically, SCS iterates between two processes: i) clustering the rows to human parts or background to generate the pseudo-part labels of rows and ii) learning a row classifier to partition a person image, which is supervised by the latest pseudo-labels. This iterative scheme guarantees the accuracy of the learned image partition. 2) A self-refinement method (SCS+) to remove the background noise in stripes. We employ the above row classifier to generate the probabilities of pixels belonging to human parts (foreground) or background, which is called the class activation map (CAM). Only the most confident areas from the CAM are assigned with foreground/background labels to guide the human part refinement. Finally, by intersecting the semantics-consistent stripes with the foreground areas, SCS+ locates the human parts at pixel-level, obtaining a more robust part-aligned representation. Extensive experiments validate that SCS+ sets the new state-of-the-art performance on three widely used datasets including Market-1501, DukeMTMC-reID, and CUHK03-NP.
Collapse
|
5
|
Fan C, Hu J, Huang J. Few-Shot Multi-Agent Perception With Ranking-Based Feature Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:11810-11823. [PMID: 37310844 DOI: 10.1109/tpami.2023.3285755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this article, we focus on performing few-shot learning (FSL) under multi-agent scenarios in which participating agents only have scarce labeled data and need to collaborate to predict labels of query observations. We aim at designing a coordination and learning framework in which multiple agents, such as drones and robots, can collectively perceive the environment accurately and efficiently under limited communication and computation conditions. We propose a metric-based multi-agent FSL framework which has three main components: an efficient communication mechanism that propagates compact and fine-grained query feature maps from query agents to support agents; an asymmetric attention mechanism that computes region-level attention weights between query and support feature maps; and a metric-learning module which calculates the image-level relevance between query and support data fast and accurately. Furthermore, we propose a specially designed ranking-based feature learning module, which can fully utilize the order information of training data by maximizing the inter-class distance, while minimizing the intra-class distance explicitly. We perform extensive numerical studies and demonstrate that our approach can achieve significantly improved accuracy in visual and acoustic perception tasks such as face identification, semantic segmentation, and sound genre recognition, consistently outperforming the state-of-the-art baselines by 5%-20%.
Collapse
|
6
|
Wu LY, Liu L, Wang Y, Zhang Z, Boussaid F, Bennamoun M, Xie X. Learning Resolution-Adaptive Representations for Cross-Resolution Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:4800-4811. [PMID: 37610890 DOI: 10.1109/tip.2023.3305817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/25/2023]
Abstract
Cross-resolution person re-identification (CRReID) is a challenging and practical problem that involves matching low-resolution (LR) query identity images against high-resolution (HR) gallery images. Query images often suffer from resolution degradation due to the different capturing conditions from real-world cameras. State-of-the-art solutions for CRReID either learn a resolution-invariant representation or adopt a super-resolution (SR) module to recover the missing information from the LR query. In this paper, we propose an alternative SR-free paradigm to directly compare HR and LR images via a dynamic metric that is adaptive to the resolution of a query image. We realize this idea by learning resolution-adaptive representations for cross-resolution comparison. We propose two resolution-adaptive mechanisms to achieve this. The first mechanism encodes the resolution specifics into different subvectors in the penultimate layer of the deep neural network, creating a varying-length representation. To better extract resolution-dependent information, we further propose to learn resolution-adaptive masks for intermediate residual feature blocks. A novel progressive learning strategy is proposed to train those masks properly. These two mechanisms are combined to boost the performance of CRReID. Experimental results show that the proposed method outperforms existing approaches and achieves state-of-the-art performance on multiple CRReID benchmarks.
Collapse
|
7
|
Liu H, Ma S, Xia D, Li S. SFANet: A Spectrum-Aware Feature Augmentation Network for Visible-Infrared Person Reidentification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:1958-1971. [PMID: 34464275 DOI: 10.1109/tnnls.2021.3105702] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Visible-Infrared person reidentification (VI-ReID) is a challenging matching problem due to large modality variations between visible and infrared images. Existing approaches usually bridge the modality gap with only feature-level constraints, ignoring pixel-level variations. Some methods employ a generative adversarial network (GAN) to generate style-consistent images, but it destroys the structure information and incurs a considerable level of noise. In this article, we explicitly consider these challenges and formulate a novel spectrum-aware feature augmentation network named SFANet for cross-modality matching problem. Specifically, we put forward to employ grayscale-spectrum images to fully replace RGB images for feature learning. Learning with the grayscale-spectrum images, our model can apparently reduce modality discrepancy and detect inner structure relations across the different modalities, making it robust to color variations. At feature level, we improve the conventional two-stream network by balancing the number of specific and sharable convolutional blocks, which preserve the spatial structure information of features. Additionally, a bidirectional tri-constrained top-push ranking loss (BTTR) is embedded in the proposed network to improve the discriminability, which efficiently further boosts the matching accuracy. Meanwhile, we further introduce an effective dual-linear with batch normalization identification (ID) embedding method to model the identity-specific information and assist BTTR loss in magnitude stabilizing. On SYSU-MM01 and RegDB datasets, we conducted extensively experiments to demonstrate that our proposed framework contributes indispensably and achieves a very competitive VI-ReID performance.
Collapse
|
8
|
Yang Z, Zhang C, Li R, Xu Y, Lin G. Efficient Few-Shot Object Detection via Knowledge Inheritance. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 32:321-334. [PMID: 37015553 DOI: 10.1109/tip.2022.3228162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Few-shot object detection (FSOD), which aims at learning a generic detector that can adapt to unseen tasks with scarce training samples, has witnessed consistent improvement recently. However, most existing methods ignore the efficiency issues, e.g., high computational complexity and slow adaptation speed. Notably, efficiency has become an increasingly important evaluation metric for few-shot techniques due to an emerging trend toward embedded AI. To this end, we present an efficient pretrain-transfer framework (PTF) baseline with no computational increment, which achieves comparable results with previous state-of-the-art (SOTA) methods. Upon this baseline, we devise an initializer named knowledge inheritance (KI) to reliably initialize the novel weights for the box classifier, which effectively facilitates the knowledge transfer process and boosts the adaptation speed. Within the KI initializer, we propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights. Finally, our approach not only achieves the SOTA results across three public benchmarks, i.e., PASCAL VOC, COCO and LVIS, but also exhibits high efficiency with $1.8-100\times $ faster adaptation speed against the other methods on COCO/LVIS benchmark during few-shot transfer. To our best knowledge, this is the first work to consider the efficiency problem in FSOD. We hope to motivate a trend toward powerful yet efficient few-shot technique development. The codes are publicly available at https://github.com/Ze-Yang/Efficient-FSOD.
Collapse
|
9
|
Wu L, Liu D, Zhang W, Chen D, Ge Z, Boussaid F, Bennamoun M, Shen J. Pseudo-Pair Based Self-Similarity Learning for Unsupervised Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4803-4816. [PMID: 35830405 DOI: 10.1109/tip.2022.3186746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Person re-identification (re-ID) is of great importance to video surveillance systems by estimating the similarity between a pair of cross-camera person shorts. Current methods for estimating such similarity require a large number of labeled samples for supervised training. In this paper, we present a pseudo-pair based self-similarity learning approach for unsupervised person re-ID without human annotations. Unlike conventional unsupervised re-ID methods that use pseudo labels based on global clustering, we construct patch surrogate classes as initial supervision, and propose to assign pseudo labels to images through the pairwise gradient-guided similarity separation. This can cluster images in pseudo pairs, and the pseudos can be updated during training. Based on pseudo pairs, we propose to improve the generalization of similarity function via a novel self-similarity learning:it learns local discriminative features from individual images via intra-similarity, and discovers the patch correspondence across images via inter-similarity. The intra-similarity learning is based on channel attention to detect diverse local features from an image. The inter-similarity learning employs a deformable convolution with a non-local block to align patches for cross-image similarity. Experimental results on several re-ID benchmark datasets demonstrate the superiority of the proposed method over the state-of-the-arts.
Collapse
|
10
|
Shao H, Zhong D. Towards open-set touchless palmprint recognition via weight-based meta metric learning. PATTERN RECOGNITION 2022; 121:108247. [PMID: 34400847 PMCID: PMC8359644 DOI: 10.1016/j.patcog.2021.108247] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Revised: 07/11/2021] [Accepted: 08/11/2021] [Indexed: 06/13/2023]
Abstract
Touchless biometrics has become significant in the wake of novel coronavirus 2019 (COVID-19). Due to the convenience, user-friendly, and high-accuracy, touchless palmprint recognition shows great potential when the hygiene issues are considered during COVID-19. However, previous palmprint recognition methods are mainly focused on close-set scenario. In this paper, a novel Weight-based Meta Metric Learning (W2ML) method is proposed for accurate open-set touchless palmprint recognition, where only a part of categories is seen during training. Deep metric learning-based feature extractor is learned in a meta way to improve the generalization ability. Multiple sets are sampled randomly to define support and query sets, which are further combined into meta sets to constrain the set-based distances. Particularly, hard sample mining and weighting are adopted to select informative meta sets to improve the efficiency. Finally, embeddings with obvious inter-class and intra-class differences are obtained as features for palmprint identification and verification. Experiments are conducted on four palmprint benchmarks including fourteen constrained and unconstrained palmprint datasets. The results show that our W2ML method is more robust and efficient in dealing with open-set palmprint recognition issue as compared to the state-of-the-arts, where the accuracy is increased by up to 9.11% and the Equal Error Rate (EER) is decreased by up to 2.97%.
Collapse
Affiliation(s)
- Huikai Shao
- School of Automation Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Dexing Zhong
- School of Automation Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
- Pazhou Lab, Guangzhou 510335, China
- State Key Lab. For Novel Software Technology, Nanjing University, Nanjing, 210093, China
| |
Collapse
|
11
|
A Vision-Based Approach for the Analysis of Core Characteristics of Volcanic Ash. SENSORS 2021; 21:s21217180. [PMID: 34770486 PMCID: PMC8588176 DOI: 10.3390/s21217180] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 10/08/2021] [Accepted: 10/26/2021] [Indexed: 11/16/2022]
Abstract
Volcanic ash fall-out represents a serious hazard for air and road traffic. The forecasting models used to predict its time-space evolution require information about the core characteristics of volcanic particles, such as their granulometry. Typically, such information is gained by the spot direct observation of the ash collected at the ground or by using expensive instrumentation. In this paper, a vision-based methodology aimed at the estimation of ash granulometry is presented. A dedicated image processing paradigm was developed and implemented in LabVIEW™. The methodology was validated experimentally using digital reference images resembling different operating conditions. The outcome of the assessment procedure was very encouraging, showing an accuracy of the image processing algorithm of 1.76%.
Collapse
|
12
|
Miao Y, Lin Z, Ma X, Ding G, Han J. Learning Transformation-Invariant Local Descriptors With Low-Coupling Binary Codes. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:7554-7566. [PMID: 34449360 DOI: 10.1109/tip.2021.3106805] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Despite the great success achieved by prevailing binary local descriptors, they are still suffering from two problems: 1) vulnerable to the geometric transformations; 2) lack of an effective treatment to the highly-correlated bits that are generated by directly applying the scheme of image hashing. To tackle both limitations, we propose an unsupervised Transformation-invariant Binary Local Descriptor learning method (TBLD). Specifically, the transformation invariance of binary local descriptors is ensured by projecting the original patches and their transformed counterparts into an identical high-dimensional feature space and an identical low-dimensional descriptor space simultaneously. Meanwhile, it enforces the dissimilar image patches to have distinctive binary local descriptors. Moreover, to reduce high correlations between bits, we propose a bottom-up learning strategy, termed Adversarial Constraint Module, where low-coupling binary codes are introduced externally to guide the learning of binary local descriptors. With the aid of the Wasserstein loss, the framework is optimized to encourage the distribution of the generated binary local descriptors to mimic that of the introduced low-coupling binary codes, eventually making the former more low-coupling. Experimental results on three benchmark datasets well demonstrate the superiority of the proposed method over the state-of-the-art methods. The project page is available at https://github.com/yoqim/TBLD.
Collapse
|
13
|
Liu D, Liang C, Chen S, Tie Y, Qi L. Auto-encoder based structured dictionary learning for visual classification. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.09.088] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
14
|
Zhu L, Fan H, Luo Y, Xu M, Yang Y. Few-Shot Common-Object Reasoning Using Common-Centric Localization Network. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4253-4262. [PMID: 33830923 DOI: 10.1109/tip.2021.3070733] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In the few-shot common-localization task, given few support images without bounding box annotations at each episode, the goal is to localize the common object in the query image of unseen categories. The few-shot common-localization task involves common object reasoning from the given images, predicting the spatial locations of the object with different shapes, sizes, and orientations. In this work, we propose a common-centric localization (CCL) network for few-shot common-localization. The motivation of our common-centric localization network is to learn the common object features by dynamic feature relation reasoning via a graph convolutional network with conditional feature aggregation. First, we propose a local common object region generation pipeline to reduce background noises due to feature misalignment. Each support image predicts more accurate object spatial locations by replacing the query with the images in the support set. Second, we introduce a graph convolutional network with dynamic feature transformation to enforce the common object reasoning. To enhance the discriminability during feature matching and enable a better generalization in unseen scenarios, we leverage a conditional feature encoding function to alter visual features according to the input query adaptively. Third, we introduce a common-centric relation structure to model the correlation between the common features and the query image feature. The generated common features guide the query image feature towards a more common object-related representation. We evaluate our common-centric localization network on four datasets, i.e., CL-VOC-07, CL-VOC-12, CL-COCO, CL-VID. We obtain significant improvements compared to state-of-the-art. Our quantitative results confirm the effectiveness of our network.
Collapse
|
15
|
Wu L, Wang Y, Gao J, Wang M, Zha ZJ, Tao D. Deep Coattention-Based Comparator for Relative Representation Learning in Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:722-735. [PMID: 32275611 DOI: 10.1109/tnnls.2020.2979190] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Person re-identification (re-ID) favors discriminative representations over unseen shots to recognize identities in disjoint camera views. Effective methods are developed via pair-wise similarity learning to detect a fixed set of region features, which can be mapped to compute the similarity value. However, relevant parts of each image are detected independently without referring to the correlation on the other image. Also, region-based methods spatially position local features for their aligned similarities. In this article, we introduce the deep coattention-based comparator (DCC) to fuse codependent representations of paired images so as to correlate the best relevant parts and produce their relative representations accordingly. The proposed approach mimics the human foveation to detect the distinct regions concurrently across images and alternatively attends to fuse them into the similarity learning. Our comparator is capable of learning representations relative to a test shot and well-suited to reidentifying pedestrians in surveillance. We perform extensive experiments to provide the insights and demonstrate the state of the arts achieved by our method in benchmark data sets: 1.2 and 2.5 points gain in mean average precision (mAP) on DukeMTMC-reID and Market-1501, respectively.
Collapse
|
16
|
Liu M, Qu L, Nie L, Liu M, Duan L, Chen B. Iterative Local-Global Collaboration Learning towards One-Shot Video Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; PP:9360-9372. [PMID: 33006929 DOI: 10.1109/tip.2020.3026625] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Video person re-identification (video Re-ID) plays an important role in surveillance video analysis and has gained increasing attention recently. However, existing supervised methods require vast labeled identities across cameras, resulting in poor scalability in practical applications. Although some unsupervised approaches have been exploited for video Re-ID, they are still in their infancy due to the complex nature of learning discriminative features on unlabelled data. In this paper, we focus on one-shot video Re-ID and present an iterative local-global collaboration learning approach to learning robust and discriminative person representations. Specifically, it jointly considers the global video information and local frame sequence information to better capture the diverse appearance of the person for feature learning and pseudo-label estimation. Moreover, as the cross-entropy loss may induce the model to focus on identity-irrelevant factors, we introduce the variational information bottleneck as a regularization term to train the model together. It can help filter undesirable information and characterize subtle differences among persons. Since accuracy cannot always be guaranteed for pseudo-labels, we adopt a dynamic selection strategy to select part of pseudo-labeled data with higher confidence to update the training set and re-train the learning model. During training, our method iteratively executes the feature learning, pseudo-label estimation, and dynamic sample selection until all the unlabeled data have been seen. Extensive experiments on two public datasets, i.e., DukeMTMC-VideoReID and MARS, have verified the superiority of our model to several cutting-edge competitors.
Collapse
|
17
|
Avola D, Cinque L, Fagioli A, Foresti GL, Pannone D, Piciarelli C. Bodyprint-A Meta-Feature Based LSTM Hashing Model for Person Re-Identification. SENSORS (BASEL, SWITZERLAND) 2020; 20:E5365. [PMID: 32962168 PMCID: PMC7570836 DOI: 10.3390/s20185365] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 08/28/2020] [Accepted: 09/14/2020] [Indexed: 11/29/2022]
Abstract
Person re-identification is concerned with matching people across disjointed camera views at different places and different time instants. This task results of great interest in computer vision, especially in video surveillance applications where the re-identification and tracking of persons are required on uncontrolled crowded spaces and after long time periods. The latter aspects are responsible for most of the current unsolved problems of person re-identification, in fact, the presence of many people in a location as well as the passing of hours or days give arise to important visual appearance changes of people, for example, clothes, lighting, and occlusions; thus making person re-identification a very hard task. In this paper, for the first time in the state-of-the-art, a meta-feature based Long Short-Term Memory (LSTM) hashing model for person re-identification is presented. Starting from 2D skeletons extracted from RGB video streams, the proposed method computes a set of novel meta-features based on movement, gait, and bone proportions. These features are analysed by a network composed of a single LSTM layer and two dense layers. The first layer is used to create a pattern of the person's identity, then, the seconds are used to generate a bodyprint hash through binary coding. The effectiveness of the proposed method is tested on three challenging datasets, that is, iLIDS-VID, PRID 2011, and MARS. In particular, the reported results show that the proposed method, which is not based on visual appearance of people, is fully competitive with respect to other methods based on visual features. In addition, thanks to its skeleton model abstraction, the method results to be a concrete contribute to address open problems, such as long-term re-identification and severe illumination changes, which tend to heavily influence the visual appearance of persons.
Collapse
Affiliation(s)
- Danilo Avola
- Department of Computer Science, Sapienza University, 00198 Rome, Italy; (L.C.); (A.F.); (D.P.)
- Department of Communication and Social Research, Sapienza University, 00198 Rome, Italy
| | - Luigi Cinque
- Department of Computer Science, Sapienza University, 00198 Rome, Italy; (L.C.); (A.F.); (D.P.)
| | - Alessio Fagioli
- Department of Computer Science, Sapienza University, 00198 Rome, Italy; (L.C.); (A.F.); (D.P.)
| | - Gian Luca Foresti
- Department of Mathematics, Computer Science and Physics, University of Udine, 33100 Udine, Italy;
| | - Daniele Pannone
- Department of Computer Science, Sapienza University, 00198 Rome, Italy; (L.C.); (A.F.); (D.P.)
| | - Claudio Piciarelli
- Department of Mathematics, Computer Science and Physics, University of Udine, 33100 Udine, Italy;
| |
Collapse
|