1
|
Shi Y, Jia Y, Zhang X. FocusDet: an efficient object detector for small object. Sci Rep 2024; 14:10697. [PMID: 38730236 DOI: 10.1038/s41598-024-61136-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 05/02/2024] [Indexed: 05/12/2024] Open
Abstract
The object scale of a small object scene changes greatly, and the object is easily disturbed by a complex background. Generic object detectors do not perform well on small object detection tasks. In this paper, we focus on small object detection based on FocusDet. FocusDet refers to the small object detector proposed in this paper. It consists of three parts: backbone, feature fusion structure, and detection head. STCF-EANet was used as the backbone for feature extraction, the Bottom Focus-PAN for feature fusion, and the detection head for object localization and recognition.To maintain sufficient global context information and extract multi-scale features, the STCF-EANet network backbone is used as the feature extraction network.PAN is a feature fusion module used in general object detectors. It is used to perform feature fusion on the extracted feature maps to supplement feature information.In the feature fusion network, FocusDet uses Bottom Focus-PAN to capture a wider range of locations and lower-level feature information of small objects.SIOU-SoftNMS is the proposed algorithm for removing redundant prediction boxes in the post-processing stage. SIOU multi-dimension accurately locates the prediction box, and SoftNMS uses the Gaussian algorithm to remove redundant prediction boxes. FocusDet uses SIOU-SoftNMS to address the missed detection problem common in dense tiny objects.The VisDrone2021-DET and CCTSDB2021 object detection datasets are used as benchmarks, and tests are carried out on VisDrone2021-det-test-dev and CCTSDB-val datasets. Experimental results show that FocusDet improves mAP@.5% from 33.6% to 46.7% on the VisDrone dataset. mAP@.5% on the CCTSDB2021 dataset is improved from 81.6% to 87.8%. It is shown that the model has good performance for small object detection, and the research is innovative.
Collapse
Affiliation(s)
- Yanli Shi
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, 132000, China.
| | - Yi Jia
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, 132000, China
| | - Xianhe Zhang
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin, 132000, China
| |
Collapse
|
2
|
Zhang L, Qin L, Xu M, Chen W, Pu S, Zhang W. Randomized Spectrum Transformations for Adapting Object Detector in Unseen Domains. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:4868-4879. [PMID: 37616139 DOI: 10.1109/tip.2023.3306915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/25/2023]
Abstract
We propose a Meta Learning on Randomized Transformations (MLRT) to learn domain invariant object detectors. Domain generalization is a problem about learning an invariant model from multiple source domains which can generalize well on unseen target domains. This problem is overlooked in object detection field, which is formally named as domain generalizable object detection (DGOD). Moreover, existing domain generalization methods have the problem of domain bias so that they can easily overfit to some specific domain (e.g., source domain). In order to alleviate the domain bias, in MLRT model, a novel randomized spectrum transformation (RST) module is proposed to increase the diversity of source domains. Specifically, RST randomizes the domain specific information of images in frequency-space, which can transform single or multiple source domains into various new domains. Besides, we observe a prior that the gradient imbalance degree among domains can also reflect the domain bias. Therefore, we further propose to alleviate the domain bias from the perspective of gradient balancing, and a novel gradient weighting (GW) module is proposed to balance the gradients over all domains via a hand-crafted weight. Finally we embed our RST and GW into a general meta learning framework and the proposed MLRT model is formalized for DGOD task. Extensive experiments are conducted on six benchmarks, and our method achieves the SOTA performance.
Collapse
|
3
|
Shang R, Li W, Zhu S, Jiao L, Li Y. Multi-teacher knowledge distillation based on joint Guidance of Probe and Adaptive Corrector. Neural Netw 2023; 164:345-356. [PMID: 37163850 DOI: 10.1016/j.neunet.2023.04.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 03/05/2023] [Accepted: 04/11/2023] [Indexed: 05/12/2023]
Abstract
Knowledge distillation (KD) has been widely used in model compression. But, in the current multi-teacher KD algorithms, the student can only passively acquire the knowledge of the teacher's middle layer in a single form and all teachers use identical a guiding scheme to the student. To solve these problems, this paper proposes a multi-teacher KD based on joint Guidance of Probe and Adaptive Corrector (GPAC) method. First, GPAC proposes a teacher selection strategy guided by the Linear Classifier Probe (LCP). This strategy allows the student to select better teachers in the middle layer. Teachers are evaluated using the classification accuracy detected by LCP. Then, GPAC designs an adaptive multi-teacher instruction mechanism. The mechanism uses instructional weights to emphasize the student's predicted direction and reduce the student's difficulty learning from teachers. At the same time, every teacher can formulate guiding scheme according to the Kullback-Leibler divergence loss of the student and itself. Finally, GPAC develops a multi-level mechanism for adjusting spatial attention loss. this mechanism uses a piecewise function that varies with the number of epochs to adjust the spatial attention loss. This piecewise function classifies the student' learning about spatial attention into three levels, which can efficiently use spatial attention of teachers. GPAC and the current state-of-the-art distillation methods are tested on CIFAR-10 and CIFAR-100 datasets. The experimental results demonstrate that the proposed method in this paper can obtain higher classification accuracy.
Collapse
Affiliation(s)
- Ronghua Shang
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, China
| | - Wenzheng Li
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Guangzhou Institute of Technology, Xidian University, Guangzhou, Guangdong, China.
| | - Songling Zhu
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, China
| | - Licheng Jiao
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, China
| | - Yangyang Li
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, China
| |
Collapse
|
4
|
Lv P, Hu S, Hao T. Contrastive Proposal Extension With LSTM Network for Weakly Supervised Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6879-6892. [PMID: 36306305 DOI: 10.1109/tip.2022.3216772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Weakly supervised object detection (WSOD) has attracted more and more attention since it only uses image-level labels and can save huge annotation costs. Most of the WSOD methods use Multiple Instance Learning (MIL) as their basic framework, which regard it as an instance classification problem. However, these methods based on MIL tend to only converge on the most discriminative regions of different instances, rather than their corresponding complete regions, that is, insufficient integrity. Inspired by the human habit of observing things, we propose a new method by comparing the initial proposals and the extended ones to optimize those initial proposals. Specifically, we propose one new strategy for WSOD by involving contrastive proposal extension (CPE), which consists of multiple directional contrastive proposal extensions (D-CPEs), and each D-CPE contains LSTM-based encoders and dual-stream decoders. Firstly, the boundary of initial proposals in MIL is extended to different positions according to well-designed sequential order. Then, the CPE compares the extended proposal and the initial one by extracting the feature semantics of them using the encoders, and calculates the integrity of the initial proposal to optimize its score. These contrastive contextual semantics will guide the basic WSOD to suppress bad proposals and improve the scores of good ones. In addition, a simple dual-stream network is designed as the decoder to constrain the temporal coding of LSTM and improve the performance of WSOD further. Experiments on PASCAL VOC 2007, VOC 2012 and MS-COCO datasets show that our method has achieved the state-of-the-art results.
Collapse
|
5
|
Coupled Global–Local object detection for large VHR aerial images. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.110097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
6
|
Zhao W, Rao Y, Tang Y, Zhou J, Lu J. VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:6048-6061. [PMID: 36103440 DOI: 10.1109/tip.2022.3205207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this paper, we investigate the problem of abductive visual reasoning (AVR), which requires vision systems to infer the most plausible explanation for visual observations. Unlike previous work which performs visual reasoning on static images or synthesized scenes, we exploit long-term reasoning from instructional videos that contain a wealth of detailed information about the physical world. We conceptualize two tasks for this emerging and challenging topic. The primary task is AVR, which is based on the initial configuration and desired goal from an instructional video, and the model is expected to figure out what is the most plausible sequence of steps to achieve the goal. In order to avoid trivial solutions based on appearance information rather than reasoning, the second task called AVR++ is constructed, which requires the model to answer why the unselected options are less plausible. We introduce a new dataset called VideoABC, which consists of 46,354 unique steps derived from 11,827 instructional videos, formulated as 13,526 abductive reasoning questions with an average reasoning duration of 51 seconds. Through an adversarial hard hypothesis mining algorithm, non-trivial and high-quality problems are generated efficiently and effectively. To achieve human-level reasoning, we propose a Hierarchical Dual Reasoning Network (HDRNet) to capture the long-term dependencies among steps and observations. We establish a benchmark for abductive visual reasoning, and our method set state-of-the-arts on AVR ( ∼ 74 %) and AVR++ ( ∼ 45 %), and humans can easily achieve over 90% accuracy on these two tasks. The large performance gap reveals the limitation of current video understanding models on temporal reasoning and leaves substantial room for future research on this challenging problem. Our dataset and code are available at https://github.com/wl-zhao/VideoABC.
Collapse
|
7
|
Xu W, Zhang C, Wang Q, Dai P. FEA-Swin: Foreground Enhancement Attention Swin Transformer Network for Accurate UAV-Based Dense Object Detection. SENSORS (BASEL, SWITZERLAND) 2022; 22:6993. [PMID: 36146340 PMCID: PMC9502707 DOI: 10.3390/s22186993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 09/10/2022] [Accepted: 09/12/2022] [Indexed: 06/16/2023]
Abstract
UAV-based object detection has recently attracted a lot of attention due to its diverse applications. Most of the existing convolution neural network based object detection models can perform well in common object detection cases. However, due to the fact that objects in UAV images are spatially distributed in a very dense manner, these methods have limited performance for UAV-based object detection. In this paper, we propose a novel transformer-based object detection model to improve the accuracy of object detection in UAV images. To detect dense objects competently, an advanced foreground enhancement attention Swin Transformer (FEA-Swin) framework is designed by integrating context information into the original backbone of a Swin Transformer. Moreover, to avoid the loss of information of small objects, an improved weighted bidirectional feature pyramid network (BiFPN) is presented by designing the skip connection operation. The proposed method aggregates feature maps from four stages and keeps abundant information of small objects. Specifically, to balance the detection accuracy and efficiency, we introduce an efficient neck of the BiFPN network by removing a redundant network layer. Experimental results on both public datasets and a self-made dataset demonstrate the performance of our method compared to the state-of-the-art methods in terms of detection accuracy.
Collapse
Affiliation(s)
- Wenyu Xu
- Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China
- Science Island Branch of Graduate School, University of Science and Technology of China, Hefei 230026, China
| | - Chaofan Zhang
- Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China
| | - Qi Wang
- Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China
- Science Island Branch of Graduate School, University of Science and Technology of China, Hefei 230026, China
| | - Pangda Dai
- Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China
| |
Collapse
|
8
|
A Novel Multi-Scale Transformer for Object Detection in Aerial Scenes. DRONES 2022. [DOI: 10.3390/drones6080188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Deep learning has promoted the research of object detection in aerial scenes. However, most of the existing networks are limited by the large-scale variation of objects and the confusion of category features. To overcome these limitations, this paper proposes a novel aerial object detection framework called DFCformer. DFCformer is mainly composed of three parts: the backbone network DMViT, which introduces deformation patch embedding and multi-scale adaptive self-attention to capture sufficient features of the objects; FRGC guides feature interaction layer by layer to break the barriers between feature layers and improve the information discrimination and processing ability of multi-scale critical features; CAIM adopts an attention mechanism to fuse multi-scale features to perform hierarchical reasoning on the relationship between different levels and fully utilize the complementary information in multi-scale features. Extensive experiments have been conducted on the FAIR1M dataset, and DFCformer shows its advantages by achieving the highest scores with stronger scene adaptability.
Collapse
|
9
|
Lin J, Zheng Z, Zhong Z, Luo Z, Li S, Yang Y, Sebe N. Joint Representation Learning and Keypoint Detection for Cross-View Geo-Localization. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3780-3792. [PMID: 35604972 DOI: 10.1109/tip.2022.3175601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this paper, we study the cross-view geo-localization problem to match images from different viewpoints. The key motivation underpinning this task is to learn a discriminative viewpoint-invariant visual representation. Inspired by the human visual system for mining local patterns, we propose a new framework called RK-Net to jointly learn the discriminative Representation and detect salient Keypoints with a single Network. Specifically, we introduce a Unit Subtraction Attention Module (USAM) that can automatically discover representative keypoints from feature maps and draw attention to the salient regions. USAM contains very few learning parameters but yields significant performance improvement and can be easily plugged into different networks. We demonstrate through extensive experiments that (1) by incorporating USAM, RK-Net facilitates end-to-end joint learning without the prerequisite of extra annotations. Representation learning and keypoint detection are two highly-related tasks. Representation learning aids keypoint detection. Keypoint detection, in turn, enriches the model capability against large appearance changes caused by viewpoint variants. (2) USAM is easy to implement and can be integrated with existing methods, further improving the state-of-the-art performance. We achieve competitive geo-localization accuracy on three challenging datasets, i. e., University-1652, CVUSA and CVACT. Our code is available at https://github.com/AggMan96/RK-Net.
Collapse
|
10
|
Adaptive dense pyramid network for object detection in UAV imagery. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.03.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
11
|
Lightweight Detection Network Based on Sub-Pixel Convolution and Objectness-Aware Structure for UAV Images. SENSORS 2021; 21:s21165656. [PMID: 34451098 PMCID: PMC8402490 DOI: 10.3390/s21165656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 08/13/2021] [Accepted: 08/13/2021] [Indexed: 11/19/2022]
Abstract
Unmanned Aerial Vehicles (UAVs) can serve as an ideal mobile platform in various situations. Real-time object detection with on-board apparatus provides drones with increased flexibility as well as a higher intelligence level. In order to achieve good detection results in UAV images with complex ground scenes, small object size and high object density, most of the previous work introduced models with higher computational burdens, making deployment on mobile platforms more difficult.This paper puts forward a lightweight object detection framework. Besides being anchor-free, the framework is based on a lightweight backbone and a simultaneous up-sampling and detection module to form a more efficient detection architecture. Meanwhile, we add an objectness branch to assist the multi-class center point prediction, which notably improves the detection accuracy and only takes up very little computing resources. The results of the experiment indicate that the computational cost of this paper is 92.78% lower than the CenterNet with ResNet18 backbone, and the mAP is 2.8 points higher on the Visdrone-2018-VID dataset. A frame rate of about 220 FPS is achieved. Additionally, we perform ablation experiments to check on the validity of each part, and the method we propose is compared with other representative lightweight object detection methods on UAV image datasets.
Collapse
|