1
|
Kasantikul R, Kusakunniran W, Wu Q, Wang Z. Channel-shuffled transformers for cross-modality person re-identification in video. Sci Rep 2025; 15:15009. [PMID: 40301413 PMCID: PMC12041324 DOI: 10.1038/s41598-025-00063-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2025] [Accepted: 04/24/2025] [Indexed: 05/01/2025] Open
Abstract
Effective implementation of person re-identification (Re-ID) across different modalities (such as daylight vs night-vision) is crucial for Surveillance applications. Information from multiple frames is essential for effective re-identification, where visual components from individual frames become less reliable. While transformers can enhance the temporal information extraction, the large number of channels required for effective feature encoding introduces scaling challenges. This could lead to overfitting and instability during training. Therefore, we proposed a novel Channel-Shuffled Temporal Transformer (CSTT) for processing multi-frame sequences in conjunction with a ResNet backbone to form Hybrid Channel-Shuffled Transformer Net (HCSTNET). Replacing fully connected layers in standard multi-head attention with ShuffleNet-like structures is important for integration of transformer attention with a ResNet backbone. Applying ShuffleNet-like structures reduces overfitting through parameter reduction with channel-grouping, and further improves learned attention using channel-shuffling. According to our tests with the SYSU-MM01 dataset in comparison against simple averaging of multiple frames, only the temporal transformer with channel-shuffling achieved a measurable improvement over the baseline. We have also investigated the optimal partitioning of feature maps therein.
Collapse
Affiliation(s)
- Rangwan Kasantikul
- Faculty of Information and Communication Technology, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, 73170, Nakhon Pathom, Thailand
- School of Computer Science, The University of Sydney, Camperdown, 2006, New South Wales, Australia
| | - Worapan Kusakunniran
- Faculty of Information and Communication Technology, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, 73170, Nakhon Pathom, Thailand.
| | - Qiang Wu
- School of Electrical and Data Engineering, University of Technology Sydney, 15 Broadway, Ultimo, 2007, New South Wales, Australia
| | - Zhiyong Wang
- School of Computer Science, The University of Sydney, Camperdown, 2006, New South Wales, Australia
| |
Collapse
|
2
|
Wu Q, Xia J, Dai P, Zhou Y, Wu Y, Ji R. CycleTrans: Learning Neutral Yet Discriminative Features via Cycle Construction for Visible- Infrared Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5469-5479. [PMID: 38593014 DOI: 10.1109/tnnls.2024.3382937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Visible-infrared person re-identification (VI-ReID) is the task of matching the same individuals across the visible and infrared modalities. Its main challenge lies in the modality gap caused by the cameras operating on different spectra. Existing VI-ReID methods mainly focus on learning general features across modalities, often at the expense of feature discriminability. To address this issue, we present a novel cycle-construction-based network for neutral yet discriminative feature learning, termed CycleTrans. Specifically, CycleTrans uses a lightweight knowledge capturing module (KCM) to capture rich semantics from the modality-relevant feature maps according to pseudo anchors. Afterward, a discrepancy modeling module (DMM) is deployed to transform these features into neutral ones according to the modality-irrelevant prototypes. To ensure feature discriminability, another two KCMs are further deployed for feature cycle constructions. With cycle construction, our method can learn effective neutral features for visible and infrared images while preserving their salient semantics. Extensive experiments on SYSU-MM01 and RegDB datasets validate the merits of CycleTrans against a flurry of state-of-the-art (SOTA) methods, on rank-1 in SYSU-MM01 and on rank-1 in RegDB. Our code is available at https://github.com/DoubtedSteam/CycleTrans.
Collapse
|
3
|
Zhu A, Wang Z, Xue J, Wan X, Jin J, Wang T, Snoussi H. Improving Text-Based Person Retrieval by Excavating All-Round Information Beyond Color. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5097-5111. [PMID: 38416620 DOI: 10.1109/tnnls.2024.3368217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Text-based person retrieval is the process of searching a massive visual resource library for images of a particular pedestrian, based on a textual query. Existing approaches often suffer from a problem of color (CLR) over-reliance, which can result in a suboptimal person retrieval performance by distracting the model from other important visual cues such as texture and structure information. To handle this problem, we propose a novel framework to Excavate All-round Information Beyond Color for the task of text-based person retrieval, which is therefore termed EAIBC. The EAIBC architecture includes four branches, namely an RGB branch, a grayscale (GRS) branch, a high-frequency (HFQ) branch, and a CLR branch. Furthermore, we introduce a mutual learning (ML) mechanism to facilitate communication and learning among the branches, enabling them to take full advantage of all-round information in an effective and balanced manner. We evaluate the proposed method on three benchmark datasets, including CUHK-PEDES, ICFG-PEDES, and RSTPReid. The experimental results demonstrate that EAIBC significantly outperforms existing methods and achieves state-of-the-art (SOTA) performance in supervised, weakly supervised, and cross-domain settings.
Collapse
|
4
|
Lu Z, Lin R, Hu H. Disentangling Modality and Posture Factors: Memory-Attention and Orthogonal Decomposition for Visible-Infrared Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5494-5508. [PMID: 38619964 DOI: 10.1109/tnnls.2024.3384023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
Striving to match the person identities between visible (VIS) and near-infrared (NIR) images, VIS-NIR reidentification (Re-ID) has attracted increasing attention due to its wide applications in low-light scenes. However, owing to the modality and pose discrepancies exhibited in heterogeneous images, the extracted representations inevitably comprise various modality and posture factors, impacting the matching of cross-modality person identity. To solve the problem, we propose a disentangling modality and posture factors (DMPFs) model to disentangle modality and posture factors by fusing the information of features memory and pedestrian skeleton. Specifically, the DMPF comprises three modules: three-stream features extraction network (TFENet), modality factor disentanglement (MFD), and posture factor disentanglement (PFD). First, aiming to provide memory and skeleton information for modality and posture factors disentanglement, the TFENet is designed as a three-stream network to extract VIS-NIR image features and skeleton features. Second, to eliminate modality discrepancy across different batches, we maintain memory queues of previous batch features through the momentum updating mechanism and propose MFD to integrate features in the whole training set by memory-attention layers. These layers explore intramodality and intermodality relationships between features from the current batch and memory queues under the optimization of the optimal transport (OT) method, which encourages the heterogeneous features with the same identity to present higher similarity. Third, to decouple the posture factors from representations, we introduce the PFD module to learn posture-unrelated features with the assistance of the skeleton features. Besides, we perform subspace orthogonal decomposition on both image and skeleton features to separate the posture-related and identity-related information. The posture-related features are adopted to disentangle the posture factors from representations by a designed posture-features consistency (PfC) loss, while the identity-related features are concatenated to obtain more discriminative identity representations. The effectiveness of DMPF is validated through comprehensive experiments on two VIS-NIR pedestrian Re-ID datasets.
Collapse
|
5
|
Geng H, Peng J, Yang W, Chen D, Lv H, Li G, Shao Y. ReMamba: a hybrid CNN-Mamba aggregation network for visible-infrared person re-identification. Sci Rep 2024; 14:29362. [PMID: 39592691 PMCID: PMC11599763 DOI: 10.1038/s41598-024-80766-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2024] [Accepted: 11/21/2024] [Indexed: 11/28/2024] Open
Abstract
Visible-Infrared Person Re-identification (VI-ReID) has been consistently challenged by the significant intra-class variations and cross-modality differences between different cameras. Therefore, the key lies in how to extract discriminative modality-shared features. Existing VI-ReID methods based on Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have shortcomings in capturing global features and controlling computational complexity, respectively. To tackle these challenges, we propose a hybrid network framework called ReMamba. Specifically, we first use a CNN as the backbone network to extract multi-level features. Then, we introduce the Visual State Space (VSS) model, which is responsible for integrating the local features output by the CNN from lower to higher levels. These local features serve as a complement to global information and thereby enhancing the local details clarity of the global features. Considering the potential redundancy and semantic differences between local and global features, we design an adaptive feature aggregation module that automatically filters and effectively aggregates both types of features, incorporating an auxiliary aggregation loss to optimize the aggregation process. Furthermore, to better constrain cross-modality features and intra-modal features, we design a modal consistency identity constraint loss to alleviate cross-modality differences and extract modality-shared information. Extensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets demonstrate that our proposed ReMamba outperforms state-of-the-art VI-ReID methods.
Collapse
Affiliation(s)
- Haokun Geng
- School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University, Urumqi, 830046, China
| | - Jiaren Peng
- School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University, Urumqi, 830046, China
- Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi, 830046, China
| | - Wenzhong Yang
- School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University, Urumqi, 830046, China.
- Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi, 830046, China.
| | - Danny Chen
- School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University, Urumqi, 830046, China
| | - Hongzhen Lv
- School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University, Urumqi, 830046, China
- Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi, 830046, China
| | - Guanghan Li
- School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University, Urumqi, 830046, China
- Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi, 830046, China
| | - Yi Shao
- School of Computer Science and Technology (School of Cyberspace Security), Xinjiang University, Urumqi, 830046, China
| |
Collapse
|
6
|
Zhu J, Wu H, Chen Y, Xu H, Fu Y, Zeng H, Liu L, Lei Z. Cross-modal group-relation optimization for visible-infrared person re-identification. Neural Netw 2024; 179:106576. [PMID: 39121790 DOI: 10.1016/j.neunet.2024.106576] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 07/08/2024] [Accepted: 07/24/2024] [Indexed: 08/12/2024]
Abstract
Visible-infrared person re-identification (VIPR) plays an important role in intelligent transportation systems. Modal discrepancies between visible and infrared images seriously confuse person appearance discrimination, e.g., the similarity of the same class of different modalities is lower than the similarity between different classes of the same modality. Worse still, the modal discrepancies and appearance discrepancies are coupled with each other. The prevailing practice is to disentangle modal and appearance discrepancies, but it usually requires complex decoupling networks. In this paper, rather than disentanglement, we propose to measure and optimize modal discrepancies. We explore a cross-modal group-relation (CMGR) to describe the relationship between the same group of people in two different modalities. The CMGR has great potential in modal invariance because it considers more stable groups rather than individuals, so it is a good measurement for modal discrepancies. Furthermore, we design a group-relation correlation (GRC) loss function based on Pearson correlations to optimize CMGR, which can be easily integrated with the learning of VIPR's appearance features. Consequently, our CMGR model serves as a pivotal constraint to minimize modal discrepancies, operating in a manner similar to a loss function. It is applied solely during the training phase, thereby obviating the need for any execution during the inference phase. Experimental results on two public datasets (i.e., RegDB and SYSU-MM01) demonstrate that our CMGR method is superior to state-of-the-art approaches. In particular, on the RegDB dataset, with the help of CMGR, the rank-1 identification rate has improved by more than 7% compared to the case of not using CMGR.
Collapse
Affiliation(s)
- Jianqing Zhu
- College of Engineering, Huaqiao University, Quanzhou, China
| | - Hanxiao Wu
- College of Information Science and Engineering, Huaqiao University, Xiamen, China; School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China
| | - Yutao Chen
- College of Engineering, Huaqiao University, Quanzhou, China
| | - Heng Xu
- College of Engineering, Huaqiao University, Quanzhou, China
| | - Yuqing Fu
- College of Engineering, Huaqiao University, Quanzhou, China
| | - Huanqiang Zeng
- College of Engineering, Huaqiao University, Quanzhou, China.
| | - Liu Liu
- School of Artificial Intelligence and State Key Lab of Software Development Environment, Beihang University, Beijing, China.
| | - Zhen Lei
- State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China
| |
Collapse
|
7
|
Li S, Li F, Li J, Li H, Zhang B, Tao D, Gao X. Logical Relation Inference and Multiview Information Interaction for Domain Adaptation Person Re-Identification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14770-14782. [PMID: 37307174 DOI: 10.1109/tnnls.2023.3281504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Domain adaptation person re-identification (Re-ID) is a challenging task, which aims to transfer the knowledge learned from the labeled source domain to the unlabeled target domain. Recently, some clustering-based domain adaptation Re-ID methods have achieved great success. However, these methods ignore the inferior influence on pseudo-label prediction due to the different camera styles. The reliability of the pseudo-label plays a key role in domain adaptation Re-ID, while the different camera styles bring great challenges for pseudo-label prediction. To this end, a novel method is proposed, which bridges the gap of different cameras and extracts more discriminative features from an image. Specifically, an intra-to-intermechanism is introduced, in which samples from their own cameras are first grouped and then aligned at the class level across different cameras followed by our logical relation inference (LRI). Thanks to these strategies, the logical relationship between simple classes and hard classes is justified, preventing sample loss caused by discarding the hard samples. Furthermore, we also present a multiview information interaction (MvII) module that takes features of different images from the same pedestrian as patch tokens, obtaining the global consistency of a pedestrian that contributes to the discriminative feature extraction. Unlike the existing clustering-based methods, our method employs a two-stage framework that generates reliable pseudo-labels from the views of the intracamera and intercamera, respectively, to differentiate the camera styles, subsequently increasing its robustness. Extensive experiments on several benchmark datasets show that the proposed method outperforms a wide range of state-of-the-art methods. The source code has been released at https://github.com/lhf12278/LRIMV.
Collapse
|
8
|
Zhang Y, Lin Y, Yang X. AA-RGTCN: reciprocal global temporal convolution network with adaptive alignment for video-based person re-identification. Front Neurosci 2024; 18:1329884. [PMID: 38591067 PMCID: PMC10999627 DOI: 10.3389/fnins.2024.1329884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 03/05/2024] [Indexed: 04/10/2024] Open
Abstract
Person re-identification(Re-ID) aims to retrieve pedestrians under different cameras. Compared with image-based Re-ID, video-based Re-ID extracts features from video sequences that contain both spatial features and temporal features. Existing methods usually focus on the most attractive image parts, and this will lead to redundant spatial description and insufficient temporal description. Other methods that take temporal clues into consideration usually ignore misalignment between frames and only focus on a fixed length of one given sequence. In this study, we proposed a Reciprocal Global Temporal Convolution Network with Adaptive Alignment(AA-RGTCN). The structure could address the drawback of misalignment between frames and model discriminative temporal representation. Specifically, the Adaptive Alignment block is designed to shift each frame adaptively to its best position for temporal modeling. Then, we proposed the Reciprocal Global Temporal Convolution Network to model robust temporal features across different time intervals along both normal and inverted time order. The experimental results show that our AA-RGTCN can achieve 85.9% mAP and 91.0% Rank-1 on MARS, 90.6% Rank-1 on iLIDS-VID, and 96.6% Rank-1 on PRID-2011, indicating we could gain better performance than other state-of-the-art approaches.
Collapse
Affiliation(s)
- Yanjun Zhang
- School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Yanru Lin
- School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing, China
| | - Xu Yang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
9
|
Wu LY, Liu L, Wang Y, Zhang Z, Boussaid F, Bennamoun M, Xie X. Learning Resolution-Adaptive Representations for Cross-Resolution Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:4800-4811. [PMID: 37610890 DOI: 10.1109/tip.2023.3305817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/25/2023]
Abstract
Cross-resolution person re-identification (CRReID) is a challenging and practical problem that involves matching low-resolution (LR) query identity images against high-resolution (HR) gallery images. Query images often suffer from resolution degradation due to the different capturing conditions from real-world cameras. State-of-the-art solutions for CRReID either learn a resolution-invariant representation or adopt a super-resolution (SR) module to recover the missing information from the LR query. In this paper, we propose an alternative SR-free paradigm to directly compare HR and LR images via a dynamic metric that is adaptive to the resolution of a query image. We realize this idea by learning resolution-adaptive representations for cross-resolution comparison. We propose two resolution-adaptive mechanisms to achieve this. The first mechanism encodes the resolution specifics into different subvectors in the penultimate layer of the deep neural network, creating a varying-length representation. To better extract resolution-dependent information, we further propose to learn resolution-adaptive masks for intermediate residual feature blocks. A novel progressive learning strategy is proposed to train those masks properly. These two mechanisms are combined to boost the performance of CRReID. Experimental results show that the proposed method outperforms existing approaches and achieves state-of-the-art performance on multiple CRReID benchmarks.
Collapse
|
10
|
Zhou J, Dong Q, Zhang Z, Liu S, Durrani TS. Cross-Modality Person Re-Identification via Local Paired Graph Attention Network. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23084011. [PMID: 37112352 PMCID: PMC10146823 DOI: 10.3390/s23084011] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/07/2023] [Accepted: 04/13/2023] [Indexed: 06/12/2023]
Abstract
Cross-modality person re-identification (ReID) aims at searching a pedestrian image of RGB modality from infrared (IR) pedestrian images and vice versa. Recently, some approaches have constructed a graph to learn the relevance of pedestrian images of distinct modalities to narrow the gap between IR modality and RGB modality, but they omit the correlation between IR image and RGB image pairs. In this paper, we propose a novel graph model called Local Paired Graph Attention Network (LPGAT). It uses the paired local features of pedestrian images from different modalities to build the nodes of the graph. For accurate propagation of information among the nodes of the graph, we propose a contextual attention coefficient that leverages distance information to regulate the process of updating the nodes of the graph. Furthermore, we put forward Cross-Center Contrastive Learning (C3L) to constrain how far local features are from their heterogeneous centers, which is beneficial for learning the completed distance metric. We conduct experiments on the RegDB and SYSU-MM01 datasets to validate the feasibility of the proposed approach.
Collapse
Affiliation(s)
- Jianglin Zhou
- Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin 300387, China
| | - Qing Dong
- Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin 300387, China
| | - Zhong Zhang
- Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin 300387, China
| | - Shuang Liu
- Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin 300387, China
| | - Tariq S. Durrani
- Department of Electronic and Electrical Engineering, University of Strathclyde, Glasgow G1 1QE, UK
| |
Collapse
|
11
|
Uddin MK, Bhuiyan A, Bappee FK, Islam MM, Hasan M. Person Re-Identification with RGB-D and RGB-IR Sensors: A Comprehensive Survey. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23031504. [PMID: 36772548 PMCID: PMC9919319 DOI: 10.3390/s23031504] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Revised: 01/14/2023] [Accepted: 01/21/2023] [Indexed: 05/27/2023]
Abstract
Learning about appearance embedding is of great importance for a variety of different computer-vision applications, which has prompted a surge in person re-identification (Re-ID) papers. The aim of these papers has been to identify an individual over a set of non-overlapping cameras. Despite recent advances in RGB-RGB Re-ID approaches with deep-learning architectures, the approach fails to consistently work well when there are low resolutions in dark conditions. The introduction of different sensors (i.e., RGB-D and infrared (IR)) enables the capture of appearances even in dark conditions. Recently, a lot of research has been dedicated to addressing the issue of finding appearance embedding in dark conditions using different advanced camera sensors. In this paper, we give a comprehensive overview of existing Re-ID approaches that utilize the additional information from different sensor-based methods to address the constraints faced by RGB camera-based person Re-ID systems. Although there are a number of survey papers that consider either the RGB-RGB or Visible-IR scenarios, there are none that consider both RGB-D and RGB-IR. In this paper, we present a detailed taxonomy of the existing approaches along with the existing RGB-D and RGB-IR person Re-ID datasets. Then, we summarize the performance of state-of-the-art methods on several representative RGB-D and RGB-IR datasets. Finally, future directions and current issues are considered for improving the different sensor-based person Re-ID systems.
Collapse
Affiliation(s)
- Md Kamal Uddin
- Interactive Systems Lab, Graduate School of Science and Engineering, Saitama University, Saitama 338-8570, Japan
- Department of Computer Science and Telecommunication Engineering, Noakhali Science and Technology University, Noakhali 3814, Bangladesh
| | - Amran Bhuiyan
- Information Retrieval and Knowledge Management Research Laboratory, York University, Toronto, ON M3J 1P3, Canada
| | - Fateha Khanam Bappee
- Department of Computer Science and Telecommunication Engineering, Noakhali Science and Technology University, Noakhali 3814, Bangladesh
| | - Md Matiqul Islam
- Interactive Systems Lab, Graduate School of Science and Engineering, Saitama University, Saitama 338-8570, Japan
- Department of Information and Communication Engineering, University of Rajshahi, Rajshahi 6205, Bangladesh
| | - Mahmudul Hasan
- Interactive Systems Lab, Graduate School of Science and Engineering, Saitama University, Saitama 338-8570, Japan
- Department of Computer Science and Engineering, Comilla University, Kotbari 3506, Bangladesh
| |
Collapse
|
12
|
Zhao Q, Wu H, Zhu J. Margin-Based Modal Adaptive Learning for Visible-Infrared Person Re-Identification. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23031426. [PMID: 36772466 PMCID: PMC9921303 DOI: 10.3390/s23031426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Revised: 01/15/2023] [Accepted: 01/24/2023] [Indexed: 05/14/2023]
Abstract
Visible-infrared person re-identification (VIPR) has great potential for intelligent transportation systems for constructing smart cities, but it is challenging to utilize due to the huge modal discrepancy between visible and infrared images. Although visible and infrared data can appear to be two domains, VIPR is not identical to domain adaptation as it can massively eliminate modal discrepancies. Because VIPR has complete identity information on both visible and infrared modalities, once the domain adaption is overemphasized, the discriminative appearance information on the visible and infrared domains would drain. For that, we propose a novel margin-based modal adaptive learning (MMAL) method for VIPR in this paper. On each domain, we apply triplet and label smoothing cross-entropy functions to learn appearance-discriminative features. Between the two domains, we design a simple yet effective marginal maximum mean discrepancy (M3D) loss function to avoid an excessive suppression of modal discrepancies to protect the features' discriminative ability on each domain. As a result, our MMAL method could learn modal-invariant yet appearance-discriminative features for improving VIPR. The experimental results show that our MMAL method acquires state-of-the-art VIPR performance, e.g., on the RegDB dataset in the visible-to-infrared retrieval mode, the rank-1 accuracy is 93.24% and the mean average precision is 83.77%.
Collapse
Affiliation(s)
- Qianqian Zhao
- College of Information Science and Engineering, Huaqiao University, Xiamen 361021, China
| | - Hanxiao Wu
- College of Information Science and Engineering, Huaqiao University, Xiamen 361021, China
| | - Jianqing Zhu
- College of Engineering, Huaqiao University, Quanzhou 362021, China
- Xiamen Yealink Network Technology Company Limited, Xiamen 361015, China
- Correspondence:
| |
Collapse
|
13
|
Zheng X, Chen X, Lu X. Visible-Infrared Person Re-Identification via Partially Interactive Collaboration. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6951-6963. [PMID: 36322494 DOI: 10.1109/tip.2022.3217697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Visible-infrared person re-identification (VI-ReID) task aims to retrieve the same person between visible and infrared images. VI-ReID is challenging as the images captured by different spectra present large cross-modality discrepancy. Many methods adopt a two-stream network and design additional constraint conditions to extract shared features for different modalities. However, the interaction between the feature extraction processes of different modalities is rarely considered. In this paper, a partially interactive collaboration method is proposed to exploit the complementary information of different modalities to reduce the modality gap for VI-ReID. Specifically, the proposed method is achieved in a partially interactive-shared architecture: collaborative shallow layers and shared deep layers. The collaborative shallow layers consider the interaction between modality-specific features of different modalities, encouraging the feature extraction processes of different modalities constrain each other to enhance feature representations. The shared deep layers further embed the modality-specific features to a common space to endow them the same identity discriminability. To ensure the interactive collaborative learning implement effectively, the conventional loss and collaborative loss are utilized jointly to train the whole network. Extensive experiments on two publicly available VI-ReID datasets verify the superiority of the proposed PIC method. Specifically, the proposed method achieves a rank-1 accuracy of 83.6% and 57.5% on RegDB and SYSU-MM01 datasets, respectively.
Collapse
|
14
|
Chen C, Ye M, Qi M, Wu J, Jiang J, Lin CW. Structure-Aware Positional Transformer for Visible-Infrared Person Re-Identification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2352-2364. [PMID: 35235507 DOI: 10.1109/tip.2022.3141868] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Visible-infrared person re-identification (VI-ReID) is a cross-modality retrieval problem, which aims at matching the same pedestrian between the visible and infrared cameras. Due to the existence of pose variation, occlusion, and huge visual differences between the two modalities, previous studies mainly focus on learning image-level shared features. Since they usually learn a global representation or extract uniformly divided part features, these methods are sensitive to misalignments. In this paper, we propose a structure-aware positional transformer (SPOT) network to learn semantic-aware sharable modality features by utilizing the structural and positional information. It consists of two main components: attended structure representation (ASR) and transformer-based part interaction (TPI). Specifically, ASR models the modality-invariant structure feature for each modality and dynamically selects the discriminative appearance regions under the guidance of the structure information. TPI mines the part-level appearance and position relations with a transformer to learn discriminative part-level modality features. With a weighted combination of ASR and TPI, the proposed SPOT explores the rich contextual and structural information, effectively reducing cross-modality difference and enhancing the robustness against misalignments. Extensive experiments indicate that SPOT is superior to the state-of-the-art methods on two cross-modal datasets. Notably, the Rank-1/mAP value on the SYSU-MM01 dataset has improved by 8.43%/6.80%.
Collapse
|
15
|
MFCNet: Mining Features Context Network for RGB–IR Person Re-Identification. FUTURE INTERNET 2021. [DOI: 10.3390/fi13110290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
RGB–IR cross modality person re-identification (RGB–IR Re-ID) is an important task for video surveillance in poorly illuminated or dark environments. In addition to the common challenge of Re-ID, the large cross-modality variations between RGB and IR images must be considered. The existing RGB–IR Re-ID methods use different network structures to learn the global shared features associated with multi-modalities. However, most global shared feature learning methods are sensitive to background clutter, and contextual feature relationships are not considered among the mined features. To solve these problems, this paper proposes a dual-path attention network architecture MFCNet. SGA (Spatial-Global Attention) module embedded in MFCNet includes spatial attention and global attention branches to mine discriminative features. First, the SGA module proposed in this paper focuses on the key parts of the input image to obtain robust features. Next, the module mines the contextual relationships among features to obtain discriminative features and improve network performance. Finally, extensive experiments demonstrate that the performance of the network architecture proposed in this paper is better than that of state-of-the-art methods under various settings. In the all-search mode of the SYSU and RegDB data sets, the rank-1 accuracy reaches 51.64% and 69.76%, respectively.
Collapse
|