1
|
Yu X, Pan Z, Zhao Y, Gao Y. Self-Supervised Lie Algebra Representation Learning via Optimal Canonical Metric. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3547-3558. [PMID: 38329862 DOI: 10.1109/tnnls.2024.3355492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/10/2024]
Abstract
Learning discriminative representation with limited training samples is emerging as an important yet challenging visual categorization task. While prior work has shown that incorporating self-supervised learning can improve performance, we found that the direct use of canonical metric in a Lie group is theoretically incorrect. In this article, we prove that a valid optimization measurement should be a canonical metric on Lie algebra. Based on the theoretical finding, this article introduces a novel self-supervised Lie algebra network (SLA-Net) representation learning framework. Via minimizing canonical metric distance between target and predicted Lie algebra representation within a computationally convenient vector space, SLA-Net avoids computing nontrivial geodesic (locally length-minimizing curve) metric on a manifold (curved space). By simultaneously optimizing a single set of parameters shared by self-supervised learning and supervised classification, the proposed SLA-Net gains improved generalization capability. Comprehensive evaluation results on eight public datasets show the effectiveness of SLA-Net for visual categorization with limited samples.
Collapse
|
2
|
Asperti A, Naldi L, Fiorilla S. An Investigation of the Domain Gap in CLIP-Based Person Re-Identification. SENSORS (BASEL, SWITZERLAND) 2025; 25:363. [PMID: 39860732 PMCID: PMC11769178 DOI: 10.3390/s25020363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 12/06/2024] [Accepted: 01/06/2025] [Indexed: 01/27/2025]
Abstract
Person re-identification (re-id) is a critical computer vision task aimed at identifying individuals across multiple non-overlapping cameras, with wide-ranging applications in intelligent surveillance systems. Despite recent advances, the domain gap-performance degradation when models encounter unseen datasets-remains a critical challenge. CLIP-based models, leveraging multimodal pre-training, offer potential for mitigating this issue by aligning visual and textual representations. In this study, we provide a comprehensive quantitative analysis of the domain gap in CLIP-based re-id systems across standard benchmarks, including Market-1501, DukeMTMC-reID, MSMT17, and Airport, simulating real-world deployment conditions. We systematically measure the performance of these models in terms of mean average precision (mAP) and Rank-1 accuracy, offering insights into the challenges faced during dataset transitions. Our analysis highlights the specific advantages introduced by CLIP's visual-textual alignment and evaluates its contribution relative to strong image encoder baselines. Additionally, we evaluate the impact of extending training sets with non-domain-specific data and incorporating random erasing augmentation, achieving an average improvement of +4.3% in mAP and +4.0% in Rank-1 accuracy. Our findings underscore the importance of standardized benchmarks and systematic evaluations for enhancing reproducibility and guiding future research. This work contributes to a deeper understanding of the domain gap in re-id, while highlighting pathways for improving model robustness and generalization in diverse, real-world scenarios.
Collapse
|
3
|
Bi Q, Zhou B, Ji W, Xia GS. Universal Fine-grained Visual Categorization by Concept Guided Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; PP:394-409. [PMID: 40030876 DOI: 10.1109/tip.2024.3523802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Existing fine-grained visual categorization (FGVC) methods assume that the fine-grained semantics rest in the informative parts of an image. This assumption works well on favorable front-view object-centric images, but can face great challenges in many real-world scenarios, such as scene-centric images (e.g., street view) and adverse viewpoint (e.g., object reidentification, remote sensing). In such scenarios, the mis-/over-feature activation is likely to confuse the part selection and degrade the fine-grained representation. In this paper, we are motivated to design a universal FGVC framework for real-world scenarios. More precisely, we propose a concept guided learning (CGL), which models concepts of a certain fine-grained category as a combination of inherited concepts from its subordinate coarse-grained category and discriminative concepts from its own. The discriminative concepts is utilized to guide the fine-grained representation learning. Specifically, three key steps are designed, namely, concept mining, concept fusion, and concept constraint. On the other hand, to bridge the FGVC dataset gap under scene-centric and adverse viewpoint scenarios, a Fine-grained Land-cover Categorization Dataset (FGLCD) with 59,994 fine-grained samples is proposed. Extensive experiments show the proposed CGL: 1) has a competitive performance on conventional FGVC; 2) achieves state-of-the-art performance on fine-grained aerial scenes & scene-centric street scenes; 3) good generalization on object re-identification and fine-grained aerial object detection. The dataset and source code will be available at https://github.com/BiQiWHU/CGL.
Collapse
|
4
|
Boned C, Talarmain M, Ghanmi N, Chiron G, Biswas S, Awal AM, Ramos Terrades O. Synthetic dataset of ID and Travel Documents. Sci Data 2024; 11:1356. [PMID: 39695172 DOI: 10.1038/s41597-024-04160-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 11/21/2024] [Indexed: 12/20/2024] Open
Abstract
This paper presents a new synthetic dataset of ID and travel documents, called SIDTD. The SIDTD dataset is created to help training and evaluating forged ID documents detection systems. Such a dataset has become a necessity as ID documents contain personal information and a public dataset of real documents can not be released. Moreover, forged documents are scarce, compared to legit ones, and the way they are generated varies from one fraudster to another resulting in a class of high intra-variability. In this paper we introduce a dataset, synthetically generated, that simulates the most common, and easiest, forgeries to be made by common users of ID documents and travel documents. The creation of this dataset will help to document image analysis community to progress in the task of automatic ID document verification in online onboarding systems.
Collapse
Affiliation(s)
- Carlos Boned
- Computer Vision Centre, Bellaterra, 08193, Spain
| | | | - Nabil Ghanmi
- IDNow, 122 Rue Robert Keller, 35220, Cesson-Sévigné, France
| | | | | | | | - Oriol Ramos Terrades
- Computer Vision Centre, Bellaterra, 08193, Spain.
- Universitat Autònoma de Barcelona, Dep. Computer Science, Bellaterra, 08193, Spain.
| |
Collapse
|
5
|
Tan B, Xiao Y, Wang Y, Li S, Yang J, Cao Z, Zhou JT, Yuan J. Beyond Pattern Variance: Unsupervised 3-D Action Representation Learning With Point Cloud Sequence. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18186-18199. [PMID: 37729565 DOI: 10.1109/tnnls.2023.3312673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
This work pays the first research effort to address unsupervised 3-D action representation learning with point cloud sequence, which is different from existing unsupervised methods that rely on 3-D skeleton information. Our proposition is built on the state-of-the-art 3-D action descriptor 3-D dynamic voxel (3DV) with contrastive learning (CL). The 3DV can compress the point cloud sequence into a compact point cloud of 3-D motion information. Spatiotemporal data augmentations are conducted on it to drive CL. However, we find that existing CL methods (e.g., SimCLR or MoCo v2) often suffer from high pattern variance toward the augmented 3DV samples from the same action instance, that is, the augmented 3DV samples are still of high feature complementarity after CL, while the complementary discriminative clues within them have not been well exploited yet. To address this, a feature augmentation adapted CL (FACL) approach is proposed, which facilitates 3-D action representation via concerning the features from all augmented 3DV samples jointly, in spirit of feature augmentation. FACL runs in a global-local way: one branch learns global feature that involves the discriminative clues from the raw and augmented 3DV samples, and the other focuses on enhancing the discriminative power of local feature learned from each augmented 3DV sample. The global and local features are fused to characterize 3-D action jointly via concatenation. To fit FACL, a series of spatiotemporal data augmentation approaches is also studied on 3DV. Wide-range experiments verify the superiority of our unsupervised learning method for 3-D action feature learning. It outperforms the state-of-the-art skeleton-based counterparts by 6.4% and 3.6% with the cross-setup and cross-subject test settings on NTU RGB+D 120, respectively. The source code is available at https://github.com/tangent-T/FACL.
Collapse
|
6
|
Li S, Li F, Li J, Li H, Zhang B, Tao D, Gao X. Logical Relation Inference and Multiview Information Interaction for Domain Adaptation Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14770-14782. [PMID: 37307174 DOI: 10.1109/tnnls.2023.3281504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Domain adaptation person re-identification (Re-ID) is a challenging task, which aims to transfer the knowledge learned from the labeled source domain to the unlabeled target domain. Recently, some clustering-based domain adaptation Re-ID methods have achieved great success. However, these methods ignore the inferior influence on pseudo-label prediction due to the different camera styles. The reliability of the pseudo-label plays a key role in domain adaptation Re-ID, while the different camera styles bring great challenges for pseudo-label prediction. To this end, a novel method is proposed, which bridges the gap of different cameras and extracts more discriminative features from an image. Specifically, an intra-to-intermechanism is introduced, in which samples from their own cameras are first grouped and then aligned at the class level across different cameras followed by our logical relation inference (LRI). Thanks to these strategies, the logical relationship between simple classes and hard classes is justified, preventing sample loss caused by discarding the hard samples. Furthermore, we also present a multiview information interaction (MvII) module that takes features of different images from the same pedestrian as patch tokens, obtaining the global consistency of a pedestrian that contributes to the discriminative feature extraction. Unlike the existing clustering-based methods, our method employs a two-stage framework that generates reliable pseudo-labels from the views of the intracamera and intercamera, respectively, to differentiate the camera styles, subsequently increasing its robustness. Extensive experiments on several benchmark datasets show that the proposed method outperforms a wide range of state-of-the-art methods. The source code has been released at https://github.com/lhf12278/LRIMV.
Collapse
|
7
|
Zheng Z, Wang X, Zheng N, Yang Y. Parameter-Efficient Person Re-Identification in the 3D Space. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7534-7547. [PMID: 36315532 DOI: 10.1109/tnnls.2022.3214834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
People live in a 3D world. However, existing works on person re-identification (re-id) mostly consider the semantic representation learning in a 2D space, intrinsically limiting the understanding of people. In this work, we address this limitation by exploring the prior knowledge of the 3D body structure. Specifically, we project 2D images to a 3D space and introduce a novel parameter-efficient omni-scale graph network (OG-Net) to learn the pedestrian representation directly from 3D point clouds. OG-Net effectively exploits the local information provided by sparse 3D points and takes advantage of the structure and appearance information in a coherent manner. With the help of 3D geometry information, we can learn a new type of deep re-id feature free from noisy variants, such as scale and viewpoint. To our knowledge, we are among the first attempts to conduct person re-id in the 3D space. We demonstrate through extensive experiments that the proposed method: (1) eases the matching difficulty in the traditional 2D space; 2) exploits the complementary information of 2D appearance and 3D structure; 3) achieves competitive results with limited parameters on four large-scale person re-id datasets; and 4) has good scalability to unseen datasets. Our code, models, and generated 3D human data are publicly available at https://github.com/layumi/person-reid-3d.
Collapse
|
8
|
Zhu K, Guo H, Liu S, Wang J, Tang M. Learning Semantics-Consistent Stripes With Self-Refinement for Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:8531-8542. [PMID: 35298384 DOI: 10.1109/tnnls.2022.3151487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Aligning human parts automatically is one of the most challenging problems for person re-identification (re-ID). Recently, the stripe-based methods, which equally partition the person images into the fixed stripes for aligned representation learning, have achieved great success. However, the stripes with fixed height and position cannot well handle the misalignment problems caused by inaccurate detection and occlusion and may introduce much background noise. In this article, we aim at learning adaptive stripes with foreground refinement to achieve pixel-level part alignment by only using person identity labels for person re-ID and make two contributions. 1) A semantics-consistent stripe learning method (SCS). Given an image, SCS partitions it into adaptive horizontal stripes and each stripe is corresponding to a specific semantic part. Specifically, SCS iterates between two processes: i) clustering the rows to human parts or background to generate the pseudo-part labels of rows and ii) learning a row classifier to partition a person image, which is supervised by the latest pseudo-labels. This iterative scheme guarantees the accuracy of the learned image partition. 2) A self-refinement method (SCS+) to remove the background noise in stripes. We employ the above row classifier to generate the probabilities of pixels belonging to human parts (foreground) or background, which is called the class activation map (CAM). Only the most confident areas from the CAM are assigned with foreground/background labels to guide the human part refinement. Finally, by intersecting the semantics-consistent stripes with the foreground areas, SCS+ locates the human parts at pixel-level, obtaining a more robust part-aligned representation. Extensive experiments validate that SCS+ sets the new state-of-the-art performance on three widely used datasets including Market-1501, DukeMTMC-reID, and CUHK03-NP.
Collapse
|
9
|
Wu LY, Liu L, Wang Y, Zhang Z, Boussaid F, Bennamoun M, Xie X. Learning Resolution-Adaptive Representations for Cross-Resolution Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:4800-4811. [PMID: 37610890 DOI: 10.1109/tip.2023.3305817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/25/2023]
Abstract
Cross-resolution person re-identification (CRReID) is a challenging and practical problem that involves matching low-resolution (LR) query identity images against high-resolution (HR) gallery images. Query images often suffer from resolution degradation due to the different capturing conditions from real-world cameras. State-of-the-art solutions for CRReID either learn a resolution-invariant representation or adopt a super-resolution (SR) module to recover the missing information from the LR query. In this paper, we propose an alternative SR-free paradigm to directly compare HR and LR images via a dynamic metric that is adaptive to the resolution of a query image. We realize this idea by learning resolution-adaptive representations for cross-resolution comparison. We propose two resolution-adaptive mechanisms to achieve this. The first mechanism encodes the resolution specifics into different subvectors in the penultimate layer of the deep neural network, creating a varying-length representation. To better extract resolution-dependent information, we further propose to learn resolution-adaptive masks for intermediate residual feature blocks. A novel progressive learning strategy is proposed to train those masks properly. These two mechanisms are combined to boost the performance of CRReID. Experimental results show that the proposed method outperforms existing approaches and achieves state-of-the-art performance on multiple CRReID benchmarks.
Collapse
|
10
|
Liu H, Ma S, Xia D, Li S. SFANet: A Spectrum-Aware Feature Augmentation Network for Visible-Infrared Person Reidentification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:1958-1971. [PMID: 34464275 DOI: 10.1109/tnnls.2021.3105702] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Visible-Infrared person reidentification (VI-ReID) is a challenging matching problem due to large modality variations between visible and infrared images. Existing approaches usually bridge the modality gap with only feature-level constraints, ignoring pixel-level variations. Some methods employ a generative adversarial network (GAN) to generate style-consistent images, but it destroys the structure information and incurs a considerable level of noise. In this article, we explicitly consider these challenges and formulate a novel spectrum-aware feature augmentation network named SFANet for cross-modality matching problem. Specifically, we put forward to employ grayscale-spectrum images to fully replace RGB images for feature learning. Learning with the grayscale-spectrum images, our model can apparently reduce modality discrepancy and detect inner structure relations across the different modalities, making it robust to color variations. At feature level, we improve the conventional two-stream network by balancing the number of specific and sharable convolutional blocks, which preserve the spatial structure information of features. Additionally, a bidirectional tri-constrained top-push ranking loss (BTTR) is embedded in the proposed network to improve the discriminability, which efficiently further boosts the matching accuracy. Meanwhile, we further introduce an effective dual-linear with batch normalization identification (ID) embedding method to model the identity-specific information and assist BTTR loss in magnitude stabilizing. On SYSU-MM01 and RegDB datasets, we conducted extensively experiments to demonstrate that our proposed framework contributes indispensably and achieves a very competitive VI-ReID performance.
Collapse
|
11
|
Mugruza-Vassallo CA, Granados-Domínguez JL, Flores-Benites V, Córdova-Berríos L. Different Markov chains modulate visual stimuli processing in a Go-Go experiment in 2D, 3D, and augmented reality. Front Hum Neurosci 2022; 16:955534. [PMID: 36569471 PMCID: PMC9769205 DOI: 10.3389/fnhum.2022.955534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Accepted: 10/11/2022] [Indexed: 11/23/2022] Open
Abstract
The introduction of Augmented Reality (AR) has attracted several developments, although the people's experience of AR has not been clearly studied or contrasted with the human experience in 2D and 3D environments. Here, the directional task was applied in 2D, 3D, and AR using simplified stimulus in video games to determine whether there is a difference in human answer reaction time prediction using context stimulus. Testing of the directional task adapted was also done. Research question: Are the main differences between 2D, 3D, and AR able to be predicted using Markov chains? Methods: A computer was fitted with a digital acquisition card in order to record, test and validate the reaction time (RT) of participants attached to the arranged RT for the theory of Markov chain probability. A Markov chain analysis was performed on the participants' data. Subsequently, the way certain factors influenced participants RT amongst the three tasks time on the accuracy of the participants was sought in the three tasks (environments) were statistically tested using ANOVA. Results: Markov chains of order 1 and 2 successfully reproduced the average reaction time by participants in 3D and AR tasks, having only 2D tasks with the variance predicted with the current state. Moreover, a clear explanation of delayed RT in every environment was done. Mood and coffee did not show significant differences in RTs on a simplified videogame. Gender differences were found in 3D, where endogenous directional goals are in 3D, but no gender differences appeared in AR where exogenous AR buttons can explain the larger RT that compensate for the gender difference. Our results suggest that unconscious preparation of selective choices is not restricted to current motor preparation. Instead, decisions in different environments and gender evolve from the dynamics of preceding cognitive activity can fit and improve neurocomputational models.
Collapse
Affiliation(s)
| | | | - Victor Flores-Benites
- Facultad de Ingeniería y Arquitectura, Universidad de Lima, Lima, Peru
- Universidad de Ingeniería y Tecnología (UTEC), Lima, Peru
| | | |
Collapse
|
12
|
Yu F, Jiang X, Gong Y, Zheng WS, Zheng F, Sun X. Conditional Feature Embedding by Visual Clue Correspondence Graph for Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6188-6199. [PMID: 36126030 DOI: 10.1109/tip.2022.3206617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Although Person Re-Identification has made impressive progress, difficult cases like occlusion, change of view-point, and similar clothing still bring great challenges. In order to tackle these challenges, extracting discriminative feature representation is crucial. Most of the existing methods focus on extracting ReID features from individual images separately. However, when matching two images, we propose that the ReID features of a query image should be dynamically adjusted based on the contextual information from the gallery image it matches. We call this type of ReID features conditional feature embedding. In this paper, we propose a novel ReID framework that extracts conditional feature embedding based on the aligned visual clues between image pairs, called Clue Alignment based Conditional Embedding (CACE-Net). CACE-Net applies an attention module to build a detailed correspondence graph between crucial visual clues in image pairs and uses discrepancy-based GCN to embed the obtained complex correspondence information into the conditional features. The experiments show that CACE-Net achieves state-of-the-art performance on three public datasets.
Collapse
|
13
|
Miao J, Wu Y, Yang Y. Identifying Visible Parts via Pose Estimation for Occluded Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4624-4634. [PMID: 33651698 DOI: 10.1109/tnnls.2021.3059515] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
We focus on the occlusion problem in person re-identification (re-id), which is one of the main challenges in real-world person retrieval scenarios. Previous methods on the occluded re-id problem usually assume that only the probes are occluded, thereby removing occlusions by manually cropping. However, this may not always hold in practice. This article relaxes this assumption and investigates a more general occlusion problem, where both the probe and gallery images could be occluded. The key to this challenging problem is depressing the noise information by identifying bodies and occlusions. We propose to incorporate the pose information into the re-id framework, which benefits the model in three aspects. First, it provides the location of the body. We then design a Pose-Masked Feature Branch to make our model focus on the body region only and filter those noise features brought by occlusions. Second, the estimated pose reveals which body parts are visible, giving us a hint to construct more informative person features. We propose a Pose-Embedded Feature Branch to adaptively re-calibrate channel-wise feature responses based on the visible body parts. Third, in testing, the estimated pose indicates which regions are informative and reliable for both probe and gallery images. Then we explicitly split the extracted spatial feature into parts. Only part features from those commonly visible parts are utilized in the retrieval. To better evaluate the performances of the occluded re-id, we also propose a large-scale data set for the occluded re-id with more than 35 000 images, namely Occluded-DukeMTMC. Extensive experiments show our approach surpasses previous methods on the occluded, partial, and non-occluded re-id data sets.
Collapse
|
14
|
Zhou S, Wang J, Shu J, Meng D, Wang L, Zheng N. Multinetwork Collaborative Feature Learning for Semisupervised Person Reidentification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4826-4839. [PMID: 33729954 DOI: 10.1109/tnnls.2021.3061164] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Person reidentification (Re-ID) aims at matching images of the same identity captured from the disjoint camera views, which remains a very challenging problem due to the large cross-view appearance variations. In practice, the mainstream methods usually learn a discriminative feature representation using a deep neural network, which needs a large number of labeled samples in the training process. In this article, we design a simple yet effective multinetwork collaborative feature learning (MCFL) framework to alleviate the data annotation requirement for person Re-ID, which can confidently estimate the pseudolabels of unlabeled sample pairs and consistently learn the discriminative features of input images. To keep the precision of pseudolabels, we further build a novel self-paced collaborative regularizer to extensively exchange the weight information of unlabeled sample pairs between different networks. Once the pseudolabels are correctly estimated, we take the corresponding sample pairs into the training process, which is beneficial to learn more discriminative features for person Re-ID. Extensive experimental results on the Market1501, DukeMTMC, and CUHK03 data sets have shown that our method outperforms most of the state-of-the-art approaches.
Collapse
|
15
|
Wu L, Liu D, Zhang W, Chen D, Ge Z, Boussaid F, Bennamoun M, Shen J. Pseudo-Pair Based Self-Similarity Learning for Unsupervised Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4803-4816. [PMID: 35830405 DOI: 10.1109/tip.2022.3186746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Person re-identification (re-ID) is of great importance to video surveillance systems by estimating the similarity between a pair of cross-camera person shorts. Current methods for estimating such similarity require a large number of labeled samples for supervised training. In this paper, we present a pseudo-pair based self-similarity learning approach for unsupervised person re-ID without human annotations. Unlike conventional unsupervised re-ID methods that use pseudo labels based on global clustering, we construct patch surrogate classes as initial supervision, and propose to assign pseudo labels to images through the pairwise gradient-guided similarity separation. This can cluster images in pseudo pairs, and the pseudos can be updated during training. Based on pseudo pairs, we propose to improve the generalization of similarity function via a novel self-similarity learning:it learns local discriminative features from individual images via intra-similarity, and discovers the patch correspondence across images via inter-similarity. The intra-similarity learning is based on channel attention to detect diverse local features from an image. The inter-similarity learning employs a deformable convolution with a non-local block to align patches for cross-image similarity. Experimental results on several re-ID benchmark datasets demonstrate the superiority of the proposed method over the state-of-the-arts.
Collapse
|
16
|
Swin Transformer Based on Two-Fold Loss and Background Adaptation Re-Ranking for Person Re-Identification. ELECTRONICS 2022. [DOI: 10.3390/electronics11131941] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Person re-identification (Re-ID) aims to identify the same pedestrian from a surveillance video in various scenarios. Existing Re-ID models are biased to learn background appearances when there are many background variations in the pedestrian training set. Thus, pedestrians with the same identity will appear with different backgrounds, which interferes with the Re-ID performance. This paper proposes a swin transformer based on two-fold loss (TL-TransNet) to pay more attention to the semantic information of a pedestrian’s body and preserve valuable background information, thereby reducing the interference of corresponding background appearance. TL-TransNet is supervised by two types of losses (i.e., circle loss and instance loss) during the training phase. In the retrieval phase, DeepLabV3+ as a pedestrian background segmentation model is applied to generate body masks in terms of query and gallery set. The background removal results are generated according to the mask and are used to filter out interfering background information. Subsequently, a background adaptation re-ranking is designed to combine the original information with the background-removed information, which digs out more positive samples with large background deviation. Extensive experiments on two public person Re-ID datasets testify that the proposed method achieves competitive robustness performance in terms of the background variation problem.
Collapse
|
17
|
Abstract
Cross-modal retrieval aims to search samples of one modality via queries of other modalities, which is a hot issue in the community of multimedia. However, two main challenges, i.e., heterogeneity gap and semantic interaction across different modalities, have not been solved efficaciously. Reducing the heterogeneous gap can improve the cross-modal similarity measurement. Meanwhile, modeling cross-modal semantic interaction can capture the semantic correlations more accurately. To this end, this paper presents a novel end-to-end framework, called Dual Attention Generative Adversarial Network (DA-GAN). This technique is an adversarial semantic representation model with a dual attention mechanism, i.e., intra-modal attention and inter-modal attention. Intra-modal attention is used to focus on the important semantic feature within a modality, while inter-modal attention is to explore the semantic interaction between different modalities and then represent the high-level semantic correlation more precisely. A dual adversarial learning strategy is designed to generate modality-invariant representations, which can reduce the cross-modal heterogeneity efficiently. The experiments on three commonly used benchmarks show the better performance of DA-GAN than these competitors.
Collapse
|
18
|
Graph Representation-Based Deep Multi-View Semantic Similarity Learning Model for Recommendation. FUTURE INTERNET 2022. [DOI: 10.3390/fi14020032] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
With the rapid development of Internet technology, how to mine and analyze massive amounts of network information to provide users with accurate and fast recommendation information has become a hot and difficult topic of joint research in industry and academia in recent years. One of the most widely used social network recommendation methods is collaborative filtering. However, traditional social network-based collaborative filtering algorithms will encounter problems such as low recommendation performance and cold start due to high data sparsity and uneven distribution. In addition, these collaborative filtering algorithms do not effectively consider the implicit trust relationship between users. To this end, this paper proposes a collaborative filtering recommendation algorithm based on graphsage (GraphSAGE-CF). The algorithm first uses graphsage to learn low-dimensional feature representations of the global and local structures of user nodes in social networks and then calculates the implicit trust relationship between users through the feature representations learned by graphsage. Finally, the comprehensive evaluation shows the scores of users and implicit users on related items and predicts the scores of users on target items. Experimental results on four open standard datasets show that our proposed graphsage-cf algorithm is superior to existing algorithms in RMSE and MAE.
Collapse
|
19
|
Chen H, He X, Yang H, Qing L, Teng Q. A Feature-Enriched Deep Convolutional Neural Network for JPEG Image Compression Artifacts Reduction and its Applications. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:430-444. [PMID: 34793307 DOI: 10.1109/tnnls.2021.3124370] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The amount of multimedia data, such as images and videos, has been increasing rapidly with the development of various imaging devices and the Internet, bringing more stress and challenges to information storage and transmission. The redundancy in images can be reduced to decrease data size via lossy compression, such as the most widely used standard Joint Photographic Experts Group (JPEG). However, the decompressed images generally suffer from various artifacts (e.g., blocking, banding, ringing, and blurring) due to the loss of information, especially at high compression ratios. This article presents a feature-enriched deep convolutional neural network for compression artifacts reduction (FeCarNet, for short). Taking the dense network as the backbone, FeCarNet enriches features to gain valuable information via introducing multi-scale dilated convolutions, along with the efficient 1 ×1 convolution for lowering both parameter complexity and computation cost. Meanwhile, to make full use of different levels of features in FeCarNet, a fusion block that consists of attention-based channel recalibration and dimension reduction is developed for local and global feature fusion. Furthermore, short and long residual connections both in the feature and pixel domains are combined to build a multi-level residual structure, thereby benefiting the network training and performance. In addition, aiming at reducing computation complexity further, pixel-shuffle-based image downsampling and upsampling layers are, respectively, arranged at the head and tail of the FeCarNet, which also enlarges the receptive field of the whole network. Experimental results show the superiority of FeCarNet over state-of-the-art compression artifacts reduction approaches in terms of both restoration capacity and model complexity. The applications of FeCarNet on several computer vision tasks, including image deblurring, edge detection, image segmentation, and object detection, demonstrate the effectiveness of FeCarNet further.
Collapse
|
20
|
|
21
|
Mugruza-Vassallo CA, Potter DD, Tsiora S, Macfarlane JA, Maxwell A. Prior context influences motor brain areas in an auditory oddball task and prefrontal cortex multitasking modelling. Brain Inform 2021; 8:5. [PMID: 33745089 PMCID: PMC7982371 DOI: 10.1186/s40708-021-00124-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 12/21/2020] [Indexed: 11/19/2022] Open
Abstract
In this study, the relationship of orienting of attention, motor control and the Stimulus- (SDN) and Goal-Driven Networks (GDN) was explored through an innovative method for fMRI analysis considering all voxels in four experimental conditions: standard target (Goal; G), novel (N), neutral (Z) and noisy target (NG). First, average reaction times (RTs) for each condition were calculated. In the second-level analysis, 'distracted' participants, as indicated by slower RTs, evoked brain activations and differences in both hemispheres' neural networks for selective attention, while the participants, as a whole, demonstrated mainly left cortical and subcortical activations. A context analysis was run in the behaviourally distracted participant group contrasting the trials immediately prior to the G trials, namely one of the Z, N or NG conditions, i.e. Z.G, N.G, NG.G. Results showed different prefrontal activations dependent on prior context in the auditory modality, recruiting between 1 to 10 prefrontal areas. The higher the motor response and influence of the previous novel stimulus, the more prefrontal areas were engaged, which extends the findings of hierarchical studies of prefrontal control of attention and better explains how auditory processing interferes with movement. Also, the current study addressed how subcortical loops and models of previous motor response affected the signal processing of the novel stimulus, when this was presented laterally or simultaneously with the target. This multitasking model could enhance our understanding on how an auditory stimulus is affecting motor responses in a way that is self-induced, by taking into account prior context, as demonstrated in the standard condition and as supported by Pulvinar activations complementing visual findings. Moreover, current BCI works address some multimodal stimulus-driven systems.
Collapse
Affiliation(s)
- Carlos A Mugruza-Vassallo
- Grupo de Investigación de Computación Y Neurociencia Cognitiva, Facultad de Ingeniería Y Gestión, Universidad Nacional Tecnológica de Lima Sur - UNTELS, Lima, Perú.
| | - Douglas D Potter
- Neuroscience and Development Group, Arts and Science, University of Dundee, Dundee, UK
| | - Stamatina Tsiora
- School of Psychology, University of Lincoln, Lincoln, United Kingdom
| | | | - Adele Maxwell
- Neuroscience and Development Group, Arts and Science, University of Dundee, Dundee, UK
| |
Collapse
|
22
|
Zhao W, Guan Z, Luo H, Peng J, Fan J. Deep Multiple Instance Hashing for Fast Multi-Object Image Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:7995-8007. [PMID: 34554911 DOI: 10.1109/tip.2021.3112011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Multi-keyword query is widely supported in text search engines. However, an analogue in image retrieval systems, multi-object query, is rarely studied. Meanwhile, traditional object-based image retrieval methods often involve multiple steps separately. In this work, we propose a weakly-supervised Deep Multiple Instance Hashing (DMIH) approach for multi-object image retrieval. Our DMIH approach, which leverages a popular CNN model to build the end-to-end relation between a raw image and the binary hash codes of its multiple objects, can support multi-object queries effectively and integrate object detection with hashing learning seamlessly. We treat object detection as a binary multiple instance learning (MIL) problem and such instances are automatically extracted from multi-scale convolutional feature maps. We also design a conditional random field (CRF) module to capture both the semantic and spatial relations among different class labels. For hashing training, we sample image pairs to learn their semantic relationships in terms of hash codes of the most probable proposals for owned labels as guided by object predictors. The two objectives benefit each other in a multi-task learning scheme. Finally, a two-level inverted index method is proposed to further speed up the retrieval of multi-object queries. Our DMIH approach outperforms state-of-the-arts on public benchmarks for object-based image retrieval and achieves promising results for multi-object queries.
Collapse
|
23
|
Ahn SS, Ta K, Thorn S, Langdon J, Sinusas AJ, Duncan JS. Multi-frame Attention Network for Left Ventricle Segmentation in 3D Echocardiography. MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION : MICCAI ... INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION 2021; 12901:348-357. [PMID: 34729554 PMCID: PMC8560213 DOI: 10.1007/978-3-030-87193-2_33] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Echocardiography is one of the main imaging modalities used to assess the cardiovascular health of patients. Among the many analyses performed on echocardiography, segmentation of left ventricle is crucial to quantify the clinical measurements like ejection fraction. However, segmentation of left ventricle in 3D echocardiography remains a challenging and tedious task. In this paper, we propose a multi-frame attention network to improve the performance of segmentation of left ventricle in 3D echocardiography. The multi-frame attention mechanism allows highly correlated spatiotemporal features in a sequence of images that come after a target image to be used to augment the performance of segmentation. Experimental results shown on 51 in vivo porcine 3D+time echocardiography images show that utilizing correlated spatiotemporal features significantly improves the performance of left ventricle segmentation when compared to other standard deep learning-based medical image segmentation models.
Collapse
Affiliation(s)
- Shawn S. Ahn
- Department of Biomedical Engineering, Yale University, New
Haven, CT, USA
| | - Kevinminh Ta
- Department of Biomedical Engineering, Yale University, New
Haven, CT, USA
| | - Stephanie Thorn
- Section of Cardiovascular Medicine, Department of Internal
Medicine, Yale University, New Haven, CT, USA
| | - Jonathan Langdon
- Department of Radiology and Biomedical Imaging, Yale
University, New Haven, CT, USA
| | - Albert J. Sinusas
- Section of Cardiovascular Medicine, Department of Internal
Medicine, Yale University, New Haven, CT, USA,Department of Electrical Engineering, Yale University, New
Haven, CT, USA,Department of Radiology and Biomedical Imaging, Yale
University, New Haven, CT, USA
| | - James S. Duncan
- Department of Biomedical Engineering, Yale University, New
Haven, CT, USA,Department of Electrical Engineering, Yale University, New
Haven, CT, USA,Department of Radiology and Biomedical Imaging, Yale
University, New Haven, CT, USA
| |
Collapse
|