1
|
Hu Y, Jiang X, Liu X, Luo X, Hu Y, Cao X, Zhang B, Zhang J. Hierarchical Self-Distilled Feature Learning for Fine-Grained Visual Categorization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4005-4018. [PMID: 34780336 DOI: 10.1109/tnnls.2021.3124135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Fine-grained visual categorization (FGVC) relies on hierarchical features extracted by deep convolutional neural networks (CNNs) to recognize closely alike objects. Particularly, shallow layer features containing rich spatial details are vital for specifying subtle differences between objects but are usually inadequately optimized due to gradient vanishing during backpropagation. In this article, hierarchical self-distillation (HSD) is introduced to generate well-optimized CNNs features for accurate fine-grained categorization. HSD inherits from the widely applied deep supervision and implements multiple intermediate losses for reinforced gradients. Besides that, we observe that the hard (one-hot) labels adopted for intermediate supervision hurt the performance of FGVC by enforcing overstrict supervision. As a solution, HSD seeks self-distillation where soft predictions generated by deeper layers of the network are hierarchically exploited to supervise shallow parts. Moreover, self-information entropy loss (SIELoss) is designed in HSD to adaptively soften intermediate predictions and facilitate better convergence. In addition, the gradient detached fusion (GDF) module is incorporated to produce an ensemble result with multiscale features via effective feature fusion. Extensive experiments on four challenging fine-grained datasets show that, with neglectable parameter increase, the proposed HSD framework and the GDF module both bring significant performance gains over different backbones, which also achieves state-of-the-art classification performance.
Collapse
|
2
|
Tian G, Sun Y, Liu Y, Zeng X, Wang M, Liu Y, Zhang J, Chen J. Adding Before Pruning: Sparse Filter Fusion for Deep Convolutional Neural Networks via Auxiliary Attention. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3930-3942. [PMID: 34487502 DOI: 10.1109/tnnls.2021.3106917] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Filter pruning is a significant feature selection technique to shrink the existing feature fusion schemes (especially on convolution calculation and model size), which helps to develop more efficient feature fusion models while maintaining state-of-the-art performance. In addition, it reduces the storage and computation requirements of deep neural networks (DNNs) and accelerates the inference process dramatically. Existing methods mainly rely on manual constraints such as normalization to select the filters. A typical pipeline comprises two stages: first pruning the original neural network and then fine-tuning the pruned model. However, choosing a manual criterion can be somehow tricky and stochastic. Moreover, directly regularizing and modifying filters in the pipeline suffer from being sensitive to the choice of hyperparameters, thus making the pruning procedure less robust. To address these challenges, we propose to handle the filter pruning issue through one stage: using an attention-based architecture that adaptively fuses the filter selection with filter learning in a unified network. Specifically, we present a pruning method named adding before pruning (ABP) to make the model focus on the filters of higher significance by training instead of man-made criteria such as norm, rank, etc. First, we add an auxiliary attention layer into the original model and set the significance scores in this layer to be binary. Furthermore, to propagate the gradients in the auxiliary attention layer, we design a specific gradient estimator and prove its effectiveness for convergence in the graph flow through mathematical derivation. In the end, to relieve the dependence on the complicated prior knowledge for designing the thresholding criterion, we simultaneously prune and train the filters to automatically eliminate network redundancy with recoverability. Extensive experimental results on the two typical image classification benchmarks, CIFAR-10 and ILSVRC-2012, illustrate that the proposed approach performs favorably against previous state-of-the-art filter pruning algorithms.
Collapse
|
3
|
Hao S, Zhou Y, Guo Y, Hong R, Cheng J, Wang M. Real-Time Semantic Segmentation via Spatial-Detail Guided Context Propagation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4042-4053. [PMID: 35259119 DOI: 10.1109/tnnls.2022.3154443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Nowadays, vision-based computing tasks play an important role in various real-world applications. However, many vision computing tasks, e.g., semantic segmentation, are usually computationally expensive, posing a challenge to the computing systems that are resource-constrained but require fast response speed. Therefore, it is valuable to develop accurate and real-time vision processing models that only require limited computational resources. To this end, we propose the spatial-detail guided context propagation network (SGCPNet) for achieving real-time semantic segmentation. In SGCPNet, we propose the strategy of spatial-detail guided context propagation. It uses the spatial details of shallow layers to guide the propagation of the low-resolution global contexts, in which the lost spatial information can be effectively reconstructed. In this way, the need for maintaining high-resolution features along the network is freed, therefore largely improving the model efficiency. On the other hand, due to the effective reconstruction of spatial details, the segmentation accuracy can be still preserved. In the experiments, we validate the effectiveness and efficiency of the proposed SGCPNet model. On the Cityscapes dataset, for example, our SGCPNet achieves 69.5% mIoU segmentation accuracy, while its speed reaches 178.5 FPS on 768 1536 images on a GeForce GTX 1080 Ti GPU card. In addition, SGCPNet is very lightweight and only contains 0.61 M parameters. The code will be released at https://github.com/zhouyuan888888/SGCPNet.
Collapse
|
4
|
Zhang L, Liu Z, Zhu X, Song Z, Yang X, Lei Z, Qiao H. Weakly Aligned Feature Fusion for Multimodal Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4145-4159. [PMID: 34437075 DOI: 10.1109/tnnls.2021.3105143] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
To achieve accurate and robust object detection in the real-world scenario, various forms of images are incorporated, such as color, thermal, and depth. However, multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned, making one object has different positions in different modalities. For the deep learning method, this problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training. In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem. First, a region feature (RF) alignment module with adjacent similarity constraint is designed to consistently predict the position shift between two modalities and adaptively align the cross-modal RFs. Second, we propose a novel region of interest (RoI) jitter strategy to improve the robustness to unexpected shift patterns. Third, we present a new multimodal feature fusion method that selects the more reliable feature and suppresses the less useful one via feature reweighting. In addition, by locating bounding boxes in both modalities and building their relationships, we provide novel multimodal labeling named KAIST-Paired. Extensive experiments on 2-D and 3-D object detection, RGB-T, and RGB-D datasets demonstrate the effectiveness and robustness of our method.
Collapse
|
5
|
Gao F, Leng J, Gan J, Gao X. RC-DETR: Improving DETRs in crowded pedestrian detection via rank-based contrastive learning. Neural Netw 2025; 182:106911. [PMID: 39612687 DOI: 10.1016/j.neunet.2024.106911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 09/14/2024] [Accepted: 11/09/2024] [Indexed: 12/01/2024]
Abstract
The variants of DEtection TRansformer (DETRs) have achieved impressive performance in general object detection. However, they suffer notable performance degradation in scenarios involving crowded pedestrian detection. This decline primarily occurs during the training phase, where DETRs are constrained solely by pedestrian labels. This limitation leads to the production of indistinguishable image features between visually similar pedestrians and background elements, resulting in incorrect detections. To address this issue, this paper introduces a rank-based contrastive learning method, which constructs an additional and specific constraint for each indistinguishable training sample to produce distinguishable image features. Unlike previous methods that rely solely on pedestrian labels to achieve a consistent confidence score, our approach relies on multiple constraints and aims to ensure the correct rank of detection results, with confidence scores of pedestrians consistently surpassing those of background elements. Specifically, we first filter out some training samples that could interfere with our delineation of indistinguishable and distinguishable training samples. Then, based on the confidence score rank, we divide the rest of the training samples into distinguishable positive and negative training samples and indistinguishable positive and negative training samples. Finally, we combine these training samples into multiple positive and negative pairs and utilize these sample pairs to train DETRs via contrastive learning. Our method can be plugged into any DETRs and does not increase any overhead on inference. Extensive experiments on three DETRs show that our method achieves superior performance. Especially on the Crowdhuman dataset, our method achieved the state-of-the-art 38.9% MR.
Collapse
Affiliation(s)
- Feng Gao
- Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Jiaxu Leng
- Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| | - Ji Gan
- Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
| | - Xinbo Gao
- Chongqing Key Laboratory of Image Cognition, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
| |
Collapse
|
6
|
He L, Li M, Wang X, Wu X, Yue G, Wang T, Zhou Y, Lei B, Zhou G. Morphology-based deep learning enables accurate detection of senescence in mesenchymal stem cell cultures. BMC Biol 2024; 22:1. [PMID: 38167069 PMCID: PMC10762950 DOI: 10.1186/s12915-023-01780-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Accepted: 11/24/2023] [Indexed: 01/05/2024] Open
Abstract
BACKGROUND Cell senescence is a sign of aging and plays a significant role in the pathogenesis of age-related disorders. For cell therapy, senescence may compromise the quality and efficacy of cells, posing potential safety risks. Mesenchymal stem cells (MSCs) are currently undergoing extensive research for cell therapy, thus necessitating the development of effective methods to evaluate senescence. Senescent MSCs exhibit distinctive morphology that can be used for detection. However, morphological assessment during MSC production is often subjective and uncertain. New tools are required for the reliable evaluation of senescent single cells on a large scale in live imaging of MSCs. RESULTS We have developed a successful morphology-based Cascade region-based convolution neural network (Cascade R-CNN) system for detecting senescent MSCs, which can automatically locate single cells of different sizes and shapes in multicellular images and assess their senescence state. Additionally, we tested the applicability of the Cascade R-CNN system for MSC senescence and examined the correlation between morphological changes with other senescence indicators. CONCLUSIONS This deep learning has been applied for the first time to detect senescent MSCs, showing promising performance in both chronic and acute MSC senescence. The system can be a labor-saving and cost-effective option for screening MSC culture conditions and anti-aging drugs, as well as providing a powerful tool for non-invasive and real-time morphological image analysis integrated into cell production.
Collapse
Affiliation(s)
- Liangge He
- Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, 1066 Xueyuan Avenue, Shenzhen, 518060, China
- Department of Medical Cell Biology and Genetics, Shenzhen Key Laboratory of Anti-Aging and Regenerative Medicine, Shenzhen Engineering Laboratory of Regenerative Technologies for Orthopedic Diseases, Shenzhen University Medical School, Shenzhen, 518060, China
| | - Mingzhu Li
- Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, 1066 Xueyuan Avenue, Shenzhen, 518060, China
| | - Xinglie Wang
- Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, 1066 Xueyuan Avenue, Shenzhen, 518060, China
| | - Xiaoyan Wu
- Department of Dermatology, Shenzhen Institute of Translational Medicine, Shenzhen Second People's Hospital, The First Affiliated Hospital of Shenzhen University, Shenzhen, 518035, China
| | - Guanghui Yue
- Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, 1066 Xueyuan Avenue, Shenzhen, 518060, China
| | - Tianfu Wang
- Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, 1066 Xueyuan Avenue, Shenzhen, 518060, China
| | - Yan Zhou
- Department of Medical Cell Biology and Genetics, Shenzhen Key Laboratory of Anti-Aging and Regenerative Medicine, Shenzhen Engineering Laboratory of Regenerative Technologies for Orthopedic Diseases, Shenzhen University Medical School, Shenzhen, 518060, China
- Lungene Biotech Ltd., Shenzhen, 18000, China
| | - Baiying Lei
- Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, School of Biomedical Engineering, Shenzhen University Medical School, 1066 Xueyuan Avenue, Shenzhen, 518060, China.
| | - Guangqian Zhou
- Department of Medical Cell Biology and Genetics, Shenzhen Key Laboratory of Anti-Aging and Regenerative Medicine, Shenzhen Engineering Laboratory of Regenerative Technologies for Orthopedic Diseases, Shenzhen University Medical School, Shenzhen, 518060, China.
| |
Collapse
|
7
|
Li T, Sun G, Yu L, Zhou K. HRBUST-LLPED: A Benchmark Dataset for Wearable Low-Light Pedestrian Detection. MICROMACHINES 2023; 14:2164. [PMID: 38138333 PMCID: PMC10745713 DOI: 10.3390/mi14122164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/26/2023] [Accepted: 11/07/2023] [Indexed: 12/24/2023]
Abstract
Detecting pedestrians in low-light conditions is challenging, especially in the context of wearable platforms. Infrared cameras have been employed to enhance detection capabilities, whereas low-light cameras capture the more intricate features of pedestrians. With this in mind, we introduce a low-light pedestrian detection (called HRBUST-LLPED) dataset by capturing pedestrian data on campus using wearable low-light cameras. Most of the data were gathered under starlight-level illumination. Our dataset annotates 32,148 pedestrian instances in 4269 keyframes. The pedestrian density reaches high values with more than seven people per image. We provide four lightweight, low-light pedestrian detection models based on advanced YOLOv5 and YOLOv8. By training the models on public datasets and fine-tuning them on the HRBUST-LLPED dataset, our model obtained 69.90% in terms of AP@0.5:0.95 and 1.6 ms for the inference time. The experiments demonstrate that our research can assist in advancing pedestrian detection research by using low-light cameras in wearable devices.
Collapse
Affiliation(s)
| | - Guanglu Sun
- School of Computer Science and Technology, Harbin University of Science and Technology, No. 52 Xuefu Road, Nangang District, Harbin 150080, China; (T.L.)
| | | | | |
Collapse
|
8
|
Li D, Tian Y, Li J. SODFormer: Streaming Object Detection With Transformer Using Events and Frames. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:14020-14037. [PMID: 37494161 DOI: 10.1109/tpami.2023.3298925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/28/2023]
Abstract
DAVIS camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges (e.g., fast motion blur and low-light). However, how to effectively leverage rich temporal cues and fuse two heterogeneous visual streams remains a challenging endeavor. To address this challenge, we propose a novel streaming object detector with Transformer, namely SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner. Technically, we first build a large-scale multimodal neuromorphic object detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1 k manual labels. Then, we design a spatiotemporal Transformer architecture to detect objects via an end-to-end sequence prediction problem, where the novel temporal Transformer module leverages rich temporal cues from two visual streams to improve the detection performance. Finally, an asynchronous attention-based fusion module is proposed to integrate two heterogeneous sensing modalities and take complementary advantages from each end, which can be queried at any time to locate objects and break through the limited output frequency from synchronized frame-based fusion strategies. The results show that the proposed SODFormer outperforms four state-of-the-art methods and our eight baselines by a significant margin. We also show that our unifying framework works well even in cases where the conventional frame-based camera fails, e.g., high-speed motion and low-light conditions. Our dataset and code can be available at https://github.com/dianzl/SODFormer.
Collapse
|
9
|
Yan J, Zhao J, Cai Y, Wang S, Qiu X, Yao X, Tian Y, Zhu Y, Cao W, Zhang X. Improving multi-scale detection layers in the deep learning network for wheat spike detection based on interpretive analysis. PLANT METHODS 2023; 19:46. [PMID: 37179312 PMCID: PMC10183117 DOI: 10.1186/s13007-023-01020-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 04/29/2023] [Indexed: 05/15/2023]
Abstract
BACKGROUND Detecting and counting wheat spikes is essential for predicting and measuring wheat yield. However, current wheat spike detection researches often directly apply the new network structure. There are few studies that can combine the prior knowledge of wheat spike size characteristics to design a suitable wheat spike detection model. It remains unclear whether the complex detection layers of the network play their intended role. RESULTS This study proposes an interpretive analysis method for quantitatively evaluating the role of three-scale detection layers in a deep learning-based wheat spike detection model. The attention scores in each detection layer of the YOLOv5 network are calculated using the Gradient-weighted Class Activation Mapping (Grad-CAM) algorithm, which compares the prior labeled wheat spike bounding boxes with the attention areas of the network. By refining the multi-scale detection layers using the attention scores, a better wheat spike detection network is obtained. The experiments on the Global Wheat Head Detection (GWHD) dataset show that the large-scale detection layer performs poorly, while the medium-scale detection layer performs best among the three-scale detection layers. Consequently, the large-scale detection layer is removed, a micro-scale detection layer is added, and the feature extraction ability in the medium-scale detection layer is enhanced. The refined model increases the detection accuracy and reduces the network complexity by decreasing the network parameters. CONCLUSION The proposed interpretive analysis method to evaluate the contribution of different detection layers in the wheat spike detection network and provide a correct network improvement scheme. The findings of this study will offer a useful reference for future applications of deep network refinement in this field.
Collapse
Affiliation(s)
- Jiawei Yan
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China
| | - Jianqing Zhao
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China
| | - Yucheng Cai
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China
| | - Suwan Wang
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China
| | - Xiaolei Qiu
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China
| | - Xia Yao
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China
- Jiangsu Key Laboratory for Information Agriculture, Nanjing, 210095, China
| | - Yongchao Tian
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing, 210095, China
| | - Yan Zhu
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China
| | - Weixing Cao
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China
| | - Xiaohu Zhang
- National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing, 210095, China.
- Key Laboratory for Crop System Analysis and Decision Making, Ministry of Agriculture and Rural Affairs, Nanjing, 210095, China.
- Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing, 210095, China.
| |
Collapse
|
10
|
Lin Z, Pei W, Chen F, Zhang D, Lu G. Pedestrian Detection by Exemplar-Guided Contrastive Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:2003-2016. [PMID: 35839180 DOI: 10.1109/tip.2022.3189803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Typical methods for pedestrian detection focus on either tackling mutual occlusions between crowded pedestrians, or dealing with the various scales of pedestrians. Detecting pedestrians with substantial appearance diversities such as different pedestrian silhouettes, different viewpoints or different dressing, remains a crucial challenge. Instead of learning each of these diverse pedestrian appearance features individually as most existing methods do, we propose to perform contrastive learning to guide the feature learning in such a way that the semantic distance between pedestrians with different appearances in the learned feature space is minimized to eliminate the appearance diversities, whilst the distance between pedestrians and background is maximized. To facilitate the efficiency and effectiveness of contrastive learning, we construct an exemplar dictionary with representative pedestrian appearances as prior knowledge to construct effective contrastive training pairs and thus guide contrastive learning. Besides, the constructed exemplar dictionary is further leveraged to evaluate the quality of pedestrian proposals during inference by measuring the semantic distance between the proposal and the exemplar dictionary. Extensive experiments on both daytime and nighttime pedestrian detection validate the effectiveness of the proposed method.
Collapse
|
11
|
Er MJ, Chen J, Zhang Y, Gao W. Research Challenges, Recent Advances, and Popular Datasets in Deep Learning-Based Underwater Marine Object Detection: A Review. SENSORS (BASEL, SWITZERLAND) 2023; 23:1990. [PMID: 36850584 PMCID: PMC9966468 DOI: 10.3390/s23041990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2022] [Revised: 01/18/2023] [Accepted: 01/20/2023] [Indexed: 06/18/2023]
Abstract
Underwater marine object detection, as one of the most fundamental techniques in the community of marine science and engineering, has been shown to exhibit tremendous potential for exploring the oceans in recent years. It has been widely applied in practical applications, such as monitoring of underwater ecosystems, exploration of natural resources, management of commercial fisheries, etc. However, due to complexity of the underwater environment, characteristics of marine objects, and limitations imposed by exploration equipment, detection performance in terms of speed, accuracy, and robustness can be dramatically degraded when conventional approaches are used. Deep learning has been found to have significant impact on a variety of applications, including marine engineering. In this context, we offer a review of deep learning-based underwater marine object detection techniques. Underwater object detection can be performed by different sensors, such as acoustic sonar or optical cameras. In this paper, we focus on vision-based object detection due to several significant advantages. To facilitate a thorough understanding of this subject, we organize research challenges of vision-based underwater object detection into four categories: image quality degradation, small object detection, poor generalization, and real-time detection. We review recent advances in underwater marine object detection and highlight advantages and disadvantages of existing solutions for each challenge. In addition, we provide a detailed critical examination of the most extensively used datasets. In addition, we present comparative studies with previous reviews, notably those approaches that leverage artificial intelligence, as well as future trends related to this hot topic.
Collapse
|
12
|
Lim J, Baskaran VM, Lim JMY, Wong K, See J, Tistarelli M. ERNet: An Efficient and Reliable Human-Object Interaction Detection Network. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:964-979. [PMID: 37022006 DOI: 10.1109/tip.2022.3231528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction, which consequently limits its potential for real-world scenarios. In this paper, we address these challenges by proposing ERNet, an end-to-end trainable convolutional-transformer network for HOI detection. The proposed model employs an efficient multi-scale deformable attention to effectively capture vital HOI features. We also put forward a novel detection attention module to adaptively generate semantically rich instance and interaction tokens. These tokens undergo pre-emptive detections to produce initial region and vector proposals that also serve as queries which enhances the feature refinement process in the transformer decoders. Several impactful enhancements are also applied to improve the HOI representation learning. Additionally, we utilize a predictive uncertainty estimation framework in the instance and interaction classification heads to quantify the uncertainty behind each prediction. By doing so, we can accurately and reliably predict HOIs even under challenging scenarios. Experiment results on the HICO-Det, V-COCO, and HOI-A datasets demonstrate that the proposed model achieves state-of-the-art performance in detection accuracy and training efficiency. Codes are publicly available at https://github.com/Monash-CyPhi-AI-Research-Lab/ernet.
Collapse
|
13
|
An objective method for pedestrian occlusion level classification. Pattern Recognit Lett 2022. [DOI: 10.1016/j.patrec.2022.10.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
14
|
Occluded pedestrian detection through bi-center prediction in anchor-free network. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
15
|
Cao J, Pang Y, Xie J, Khan FS, Shao L. From Handcrafted to Deep Features for Pedestrian Detection: A Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:4913-4934. [PMID: 33929956 DOI: 10.1109/tpami.2021.3076733] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Pedestrian detection is an important but challenging problem in computer vision, especially in human-centric tasks. Over the past decade, significant improvement has been witnessed with the help of handcrafted features and deep features. Here we present a comprehensive survey on recent advances in pedestrian detection. First, we provide a detailed review of single-spectral pedestrian detection that includes handcrafted features based methods and deep features based approaches. For handcrafted features based methods, we present an extensive review of approaches and find that handcrafted features with large freedom degrees in shape and space have better performance. In the case of deep features based approaches, we split them into pure CNN based methods and those employing both handcrafted and CNN based features. We give the statistical analysis and tendency of these methods, where feature enhanced, part-aware, and post-processing methods have attracted main attention. In addition to single-spectral pedestrian detection, we also review multi-spectral pedestrian detection, which provides more robust features for illumination variance. Furthermore, we introduce some related datasets and evaluation metrics, and a deep experimental analysis. We conclude this survey by emphasizing open problems that need to be addressed and highlighting various future directions. Researchers can track an up-to-date list at https://github.com/JialeCao001/PedSurvey.
Collapse
|
16
|
Li W, Chen Z, Li B, Zhang D, Yuan Y. HTD: Heterogeneous Task Decoupling for Two-Stage Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:9456-9469. [PMID: 34780326 DOI: 10.1109/tip.2021.3126423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Decoupling the sibling head has recently shown great potential in relieving the inherent task-misalignment problem in two-stage object detectors. However, existing works design similar structures for the classification and regression, ignoring task-specific characteristics and feature demands. Besides, the shared knowledge that may benefit the two branches is neglected, leading to potential excessive decoupling and semantic inconsistency. To address these two issues, we propose Heterogeneous task decoupling (HTD) framework for object detection, which utilizes a Progressive Graph (PGraph) module and a Border-aware Adaptation (BA) module for task-decoupling. Specifically, we first devise a Semantic Feature Aggregation (SFA) module to aggregate global semantics with image-level supervision, serving as the shared knowledge for the task-decoupled framework. Then, the PGraph module performs progressive graph reasoning, including local spatial aggregation and global semantic interaction, to enhance semantic representations of region proposals for classification. The proposed BA module integrates multi-level features adaptively, focusing on the low-level border activation to obtain representations with spatial and border perception for regression. Finally, we utilize the aggregated knowledge from SFA to keep the instance-level semantic consistency (ISC) of decoupled frameworks. Extensive experiments demonstrate that HTD outperforms existing detection works by a large margin, and achieves single-model 50.4%AP and 33.2% APs on COCO test-dev set using ResNet-101-DCN backbone, which is the best entry among state-of-the-arts under the same configuration. Our code is available at https://github.com/CityU-AIM-Group/HTD.
Collapse
|
17
|
Zhao Z, Liu Y, Sun X, Liu J, Yang X, Zhou C. Composited FishNet: Fish Detection and Species Recognition From Low-Quality Underwater Videos. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4719-4734. [PMID: 33905330 DOI: 10.1109/tip.2021.3074738] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The automatic detection and identification of fish from underwater videos is of great significance for fishery resource assessment and ecological environment monitoring. However, due to the poor quality of underwater images and unconstrained fish movement, traditional hand-designed feature extraction methods or convolutional neural network (CNN)-based object detection algorithms cannot meet the detection requirements in real underwater scenes. Therefore, to realize fish recognition and localization in a complex underwater environment, this paper proposes a novel composite fish detection framework based on a composite backbone and an enhanced path aggregation network called Composited FishNet. By improving the residual network (ResNet), a new composite backbone network (CBresnet) is designed to learn the scene change information (source domain style), which is caused by the differences in the image brightness, fish orientation, seabed structure, aquatic plant movement, fish species shape and texture differences. Thus, the interference of underwater environmental information on the object characteristics is reduced, and the output of the main network to the object information is strengthened. In addition, to better integrate the high and low feature information output from CBresnet, the enhanced path aggregation network (EPANet) is also designed to solve the insufficient utilization of semantic information caused by linear upsampling. The experimental results show that the average precision (AP)0.5:0.95, AP50 and average recall (AR)max=10 of the proposed Composited FishNet are 75.2%, 92.8% and 81.1%, respectively. The composite backbone network enhances the characteristic information output of the detected object and improves the utilization of characteristic information. This method can be used for fish detection and identification in complex underwater environments such as oceans and aquaculture.
Collapse
|
18
|
Li Y, Pang Y, Cao J, Shen J, Shao L. Improving Single Shot Object Detection With Feature Scale Unmixing. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:2708-2721. [PMID: 33417552 DOI: 10.1109/tip.2020.3048630] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Due to the advantages of real-time detection and improved performance, single-shot detectors have gained great attention recently. To solve the complex scale variations, single-shot detectors make scale-aware predictions based on multiple pyramid layers. Typically, small objects are detected on shallow layers while large objects are detected on deep layers. However, the features in the pyramid are not scale-aware enough, which limits the detection performance. Two common problems in single-shot detectors caused by object scale variations can be observed: (1) false negative problem, i.e., small objects are easily missed due to the weak features; (2) part-false positive problem, i.e., the salient part of a large object is sometimes detected as an object. With this observation, a new Neighbor Erasing and Transferring (NET) mechanism is proposed for feature scale-unmixing to explore scale-aware features in this paper. In NET, a Neighbor Erasing Module (NEM) is designed to erase the salient features of large objects and emphasize the features of small objects in shallow layers. A Neighbor Transferring Module (NTM) is introduced to transfer the erased features and highlight large objects in deep layers. With this mechanism, a single-shot network called NETNet is constructed for scale-aware object detection. In addition, we propose to aggregate nearest neighboring pyramid features to enhance our NET. Experiments on MS COCO dataset and UAVDT dataset demonstrate the effectiveness of our method. NETNet obtains 38.5% AP at a speed of 27 FPS and 32.0% AP at a speed of 55 FPS on MS COCO dataset. As a result, NETNet achieves a better trade-off for real-time and accurate object detection.
Collapse
|