1
|
Guo M, Chen B, Yan Z, Wang Y, Ye Q. Virtual Classification: Modulating Domain-Specific Knowledge for Multidomain Crowd Counting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2958-2972. [PMID: 38241099 DOI: 10.1109/tnnls.2024.3350363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2024]
Abstract
Multidomain crowd counting aims to learn a general model for multiple diverse datasets. However, deep networks prefer modeling distributions of the dominant domains instead of all domains, which is known as domain bias. In this study, we propose a simple-yet-effective modulating domain-specific knowledge network (MDKNet) to handle the domain bias issue in multidomain crowd counting. MDKNet is achieved by employing the idea of "modulating," enabling deep network balancing and modeling different distributions of diverse datasets with little bias. Specifically, we propose an instance-specific batch normalization (IsBN) module, which serves as a base modulator to refine the information flow to be adaptive to domain distributions. To precisely modulating the domain-specific information, the domain-guided virtual classifier (DVC) is then introduced to learn a domain-separable latent space. This space is employed as an input guidance for the IsBN modulator, such that the mixture distributions of multiple datasets can be well treated. Extensive experiments performed on popular benchmarks, including Shanghai-tech A/B, QNRF, and NWPU validate the superiority of MDKNet in tackling multidomain crowd counting and the effectiveness for multidomain learning. Code is available at https://github.com/csguomy/MDKNet.
Collapse
|
2
|
Zhu J, Zhao W, Yao L, He Y, Hu M, Zhang X, Wang S, Li T, Lu H. Confusion Region Mining for Crowd Counting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18039-18051. [PMID: 37713223 DOI: 10.1109/tnnls.2023.3311020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/16/2023]
Abstract
Existing works mainly focus on crowd and ignore the confusion regions which contain extremely similar appearance to crowd in the background, while crowd counting needs to face these two sides at the same time. To address this issue, we propose a novel end-to-end trainable confusion region discriminating and erasing network called CDENet. Specifically, CDENet is composed of two modules of confusion region mining module (CRM) and guided erasing module (GEM). CRM consists of basic density estimation (BDE) network, confusion region aware bridge and confusion region discriminating network. The BDE network first generates a primary density map, and then the confusion region aware bridge excavates the confusion regions by comparing the primary prediction result with the ground-truth density map. Finally, the confusion region discriminating network learns the difference of feature representations in confusion regions and crowds. Furthermore, GEM gives the refined density map by erasing the confusion regions. We evaluate the proposed method on four crowd counting benchmarks, including ShanghaiTech Part_A, ShanghaiTech Part_B, UCF_CC_50, and UCF-QNRF, and our CDENet achieves superior performance compared with the state-of-the-arts.
Collapse
|
3
|
Yuan Z. SPCANet: congested crowd counting via strip pooling combined attention network. PeerJ Comput Sci 2024; 10:e2273. [PMID: 39314741 PMCID: PMC11419659 DOI: 10.7717/peerj-cs.2273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Accepted: 07/29/2024] [Indexed: 09/25/2024]
Abstract
Crowd counting aims to estimate the number and distribution of the population in crowded places, which is an important research direction in object counting. It is widely used in public place management, crowd behavior analysis, and other scenarios, showing its robust practicality. In recent years, crowd-counting technology has been developing rapidly. However, in highly crowded and noisy scenes, the counting effect of most models is still seriously affected by the distortion of view angle, dense occlusion, and inconsistent crowd distribution. Perspective distortion causes crowds to appear in different sizes and shapes in the image, and dense occlusion and inconsistent crowd distributions result in parts of the crowd not being captured completely. This ultimately results in the imperfect capture of spatial information in the model. To solve such problems, we propose a strip pooling combined attention (SPCANet) network model based on normed-deformable convolution (NDConv). We model long-distance dependencies more efficiently by introducing strip pooling. In contrast to traditional square kernel pooling, strip pooling uses long and narrow kernels (1×N or N×1) to deal with dense crowds, mutual occlusion, and overlap. Efficient channel attention (ECA), a mechanism for learning channel attention using a local cross-channel interaction strategy, is also introduced in SPCANet. This module generates channel attention through a fast 1D convolution to reduce model complexity while improving performance as much as possible. Four mainstream datasets, Shanghai Tech Part A, Shanghai Tech Part B, UCF-QNRF, and UCF CC 50, were utilized in extensive experiments, and mean absolute error (MAE) exceeds the baseline, which is 60.9, 7.3, 90.8, and 161.1, validating the effectiveness of SPCANet. Meanwhile, mean squared error (MSE) decreases by 5.7% on average over the four datasets, and the robustness is greatly improved.
Collapse
Affiliation(s)
- Zhongyuan Yuan
- College of Information and Intelligence, Hunan Agricultural University, Changsha, Hunan Province, China
| |
Collapse
|
4
|
Guo X, Gao M, Zou G, Bruno A, Chehri A, Jeon G. Object Counting via Group and Graph Attention Network. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:11884-11895. [PMID: 38051606 DOI: 10.1109/tnnls.2023.3336894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
Object counting, defined as the task of accurately predicting the number of objects in static images or videos, has recently attracted considerable interest. However, the unavoidable presence of background noise prevents counting performance from advancing further. To address this issue, we created a group and graph attention network (GGANet) for dense object counting. GGANet is an encoder-decoder architecture incorporating a group channel attention (GCA) module and a learnable graph attention (LGA) module. The GCA module groups the feature map into several subfeatures, each of which is assigned an attention factor through the identical channel attention. The LGA module views the feature map as a graph structure in which the different channels represent diverse feature vertices, and the responses between channels represent edges. The GCA and LGA modules jointly avoid the interference of irrelevant pixels and suppress the background noise. Experiments are conducted on four crowd-counting datasets, two vehicle-counting datasets, one remote-sensing counting dataset, and one few-shot object-counting dataset. Comparative results prove that the proposed GGANet achieves superior counting performance.
Collapse
|
5
|
Chen Y, Wang Q, Yang J, Chen B, Xiong H, Du S. Learning Discriminative Features for Crowd Counting. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3749-3764. [PMID: 38848225 DOI: 10.1109/tip.2024.3408609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2024]
Abstract
Crowd counting models in highly congested areas confront two main challenges: weak localization ability and difficulty in differentiating between foreground and background, leading to inaccurate estimations. The reason is that objects in highly congested areas are normally small and high-level features extracted by convolutional neural networks are less discriminative to represent small objects. To address these problems, we propose a learning discriminative features framework for crowd counting, which is composed of a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM). The MPM randomly masks feature vectors in the feature map and then reconstructs them, allowing the model to learn about what is present in the masked regions and improving the model's ability to localize objects in high-density regions. The CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background. Additionally, the proposed modules can be beneficial in various computer vision tasks, such as crowd counting and object detection, where dense scenes or cluttered environments pose challenges to accurate localization. The proposed two modules are plug-and-play, incorporating the proposed modules into existing models can potentially boost their performance in these scenarios.
Collapse
|
6
|
Zhou H, Jiang F, Si J, Ding Y, Lu H. BPJDet: Extended Object Representation for Generic Body-Part Joint Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:4314-4330. [PMID: 38227415 DOI: 10.1109/tpami.2024.3354962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/17/2024]
Abstract
Detection of human body and its parts has been intensively studied. However, most of CNNs-based detectors are trained independently, making it difficult to associate detected parts with body. In this paper, we focus on the joint detection of human body and its parts. Specifically, we propose a novel extended object representation integrating center-offsets of body parts, and construct an end-to-end generic Body-Part Joint Detector (BPJDet). In this way, body-part associations are neatly embedded in a unified representation containing both semantic and geometric contents. Therefore, we can optimize multi-loss to tackle multi-tasks synergistically. Moreover, this representation is suitable for anchor-based and anchor-free detectors. BPJDet does not suffer from error-prone post matching, and keeps a better trade-off between speed and accuracy. Furthermore, BPJDet can be generalized to detect body-part or body-parts of either human or quadruped animals. To verify the superiority of BPJDet, we conduct experiments on datasets of body-part (CityPersons, CrowdHuman and BodyHands) and body-parts (COCOHumanParts and Animals5C). While keeping high detection accuracy, BPJDet achieves state-of-the-art association performance on all datasets. Besides, we show benefits of advanced body-part association capability by improving performance of two representative downstream applications: accurate crowd head detection and hand contact estimation.
Collapse
|
7
|
Shu W, Wan J, Chan AB. Generalized Characteristic Function Loss for Crowd Analysis in the Frequency Domain. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:2882-2899. [PMID: 37995158 DOI: 10.1109/tpami.2023.3336196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2023]
Abstract
Typical approaches that learn crowd density maps are limited to extracting the supervisory information from the loosely organized spatial information in the crowd dot/density maps. This paper tackles this challenge by performing the supervision in the frequency domain. More specifically, we devise a new loss function for crowd analysis called generalized characteristic function loss (GCFL). This loss carries out two steps: 1) transforming the spatial information in density or dot maps to the frequency domain; 2) calculating a loss value between their frequency contents. For step 1, we establish a series of theoretical fundaments by extending the definition of the characteristic function for probability distributions to density maps, as well as proving some vital properties of the extended characteristic function. After taking the characteristic function of the density map, its information in the frequency domain is well-organized and hierarchically distributed, while in the spatial domain it is loose-organized and dispersed everywhere. In step 2, we design a loss function that can fit the information organization in the frequency domain, allowing the exploitation of the well-organized frequency information for the supervision of crowd analysis tasks. The loss function can be adapted to various crowd analysis tasks through the specification of its window functions. In this paper, we demonstrate its power in three tasks: Crowd Counting, Crowd Localization and Noisy Crowd Counting. We show the advantages of our GCFL compared to other SOTA losses and its competitiveness to other SOTA methods by theoretical analysis and empirical results on benchmark datasets. Our codes are available at https://github.com/wbshu/Crowd_Counting_in_the_Frequency_Domain.
Collapse
|
8
|
Huang T, Luo T, Yan M, Zhou JT, Goh R. RCT: Resource Constrained Training for Edge AI. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2575-2587. [PMID: 35998171 DOI: 10.1109/tnnls.2022.3190451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Efficient neural network training is essential for in situ training of edge artificial intelligence (AI) and carbon footprint reduction in general. Train neural network on the edge is challenging because there is a large gap between limited resources on edge and the resource requirement of current training methods. Existing training methods are based on the assumption that the underlying computing infrastructure has sufficient memory and energy supplies. These methods involve two copies of the model parameters, which is usually beyond the capacity of on-chip memory in processors. The data movement between off-chip and on-chip memory incurs large amounts of energy. We propose resource constrained training (RCT) to realize resource-efficient training for edge devices and servers. RCT only keeps a quantized model throughout the training so that the memory requirement for model parameters in training is reduced. It adjusts per-layer bitwidth dynamically to save energy when a model can learn effectively with lower precision. We carry out experiments with representative models and tasks in image classification, natural language processing, and crowd counting applications. Experiments show that on average, 8-15-bit weight update is sufficient for achieving SOTA performance in these applications. RCT saves 63.5%-80% memory for model parameters and saves more energy for communications. Through experiments, we observe that the common practice on the first/last layer in model compression does not apply to efficient training. Also, interestingly, the more challenging a dataset is, the lower bitwidth is required for efficient training.
Collapse
|
9
|
Zhai W, Gao M, Li Q, Jeon G, Anisetti M. FPANet: feature pyramid attention network for crowd counting. APPL INTELL 2023. [DOI: 10.1007/s10489-023-04499-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/02/2023]
|
10
|
Ren G, Lu X, Wang J, Li Y. Enhancement of Local Crowd Location and Count: Multiscale Counting Guided by Head RGB-Mask. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:5708807. [PMID: 36059394 PMCID: PMC9433205 DOI: 10.1155/2022/5708807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 07/20/2022] [Accepted: 08/01/2022] [Indexed: 11/17/2022]
Abstract
Background In crowded crowd images, traditional detection models often have the problems of inaccurate multiscale target count and low recall rate. Methods In order to solve the above two problems, this paper proposes an MLP-CNN model, which combined with FPN feature pyramid can fuse the feature map of low-resolution and high-resolution semantic information with less computation and can effectively solve the problem of inaccurate head count of multiscale people. MLP-CNN "mid-term" fusion model can effectively fuse the features of RGB head image and RGB-Mask image. With the help of head RGB-Mask annotation and adaptive Gaussian kernel regression, the enhanced density map can be generated, which can effectively solve the problem of low recall of head detection. Results MLP-CNN model was applied in ShanghaiTech and UCF_ CC_ 50 and UCF-QNRF. The test results show that the error of the method proposed in this paper has been significantly improved, and the recall rate can reach 79.91%. Conclusion MLP-CNN model not only improves the accuracy of population counting in density map regression, but also improves the detection rate of multiscale population head targets.
Collapse
Affiliation(s)
- Guoyin Ren
- School of Mechanical Engineering, Inner Mongolia University of Science & Technology, Baotou 014010, China
- School of Information Engineering, Inner Mongolia University of Science & Technology, Baotou 014010, China
| | - Xiaoqi Lu
- School of Mechanical Engineering, Inner Mongolia University of Science & Technology, Baotou 014010, China
- Inner Mongolia University of Technology, Hohhot 010051, China
| | - Jingyu Wang
- School of Information Engineering, Inner Mongolia University of Science & Technology, Baotou 014010, China
| | - Yuhao Li
- School of Information Engineering, Inner Mongolia University of Science & Technology, Baotou 014010, China
| |
Collapse
|
11
|
Krishnan AM, Bouazizi M, Ohtsuki T. An Infrared Array Sensor-Based Approach for Activity Detection, Combining Low-Cost Technology with Advanced Deep Learning Techniques. SENSORS 2022; 22:s22103898. [PMID: 35632305 PMCID: PMC9145665 DOI: 10.3390/s22103898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Revised: 05/17/2022] [Accepted: 05/18/2022] [Indexed: 11/16/2022]
Abstract
In this paper, we propose an activity detection system using a 24 × 32 resolution infrared array sensor placed on the ceiling. We first collect the data at different resolutions (i.e., 24 × 32, 12 × 16, and 6 × 8) and apply the advanced deep learning (DL) techniques of Super-Resolution (SR) and denoising to enhance the quality of the images. We then classify the images/sequences of images depending on the activities the subject is performing using a hybrid deep learning model combining a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM). We use data augmentation to improve the training of the neural networks by incorporating a wider variety of samples. The process of data augmentation is performed by a Conditional Generative Adversarial Network (CGAN). By enhancing the images using SR, removing the noise, and adding more training samples via data augmentation, our target is to improve the classification accuracy of the neural network. Through experiments, we show that employing these deep learning techniques to low-resolution noisy infrared images leads to a noticeable improvement in performance. The classification accuracy improved from 78.32% to 84.43% (for images with 6 × 8 resolution), and from 90.11% to 94.54% (for images with 12 × 16 resolution) when we used the CNN and CNN + LSTM networks, respectively.
Collapse
Affiliation(s)
| | - Mondher Bouazizi
- Faculty of Science and Technology, Keio University, Yokohama 223-8522, Japan;
| | - Tomoaki Ohtsuki
- Faculty of Science and Technology, Keio University, Yokohama 223-8522, Japan;
- Correspondence:
| |
Collapse
|
12
|
Sindagi VA, Yasarla R, Patel VM. JHU-CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:2594-2609. [PMID: 33147141 DOI: 10.1109/tpami.2020.3035969] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We introduce a new large scale unconstrained crowd counting dataset (JHU-CROWD++) that contains "4,372" images with "1.51 million" annotations. In comparison to existing datasets, the proposed dataset is collected under a variety of diverse scenarios and environmental conditions. Specifically, the dataset includes several images with weather-based degradations and illumination variations, making it a very challenging dataset. Additionally, the dataset consists of a rich set of annotations at both image-level and head-level. Several recent methods are evaluated and compared on this dataset. The dataset can be downloaded from http://www.crowd-counting.com. Furthermore, we propose a novel crowd counting network that progressively generates crowd density maps via residual error estimation. The proposed method uses VGG16 as the backbone network and employs density map generated by the final layer as a coarse prediction to refine and generate finer density maps in a progressive fashion using residual learning. Additionally, the residual learning is guided by an uncertainty-based confidence weighting mechanism that permits the flow of only high-confidence residuals in the refinement path. The proposed Confidence Guided Deep Residual Counting Network (CG-DRCN) is evaluated on recent complex datasets, and it achieves significant improvements In errors.
Collapse
|
13
|
Liu X, Hu Y, Zhang B, Zhen X, Luo X, Cao X. Attentive encoder-decoder networks for crowd counting. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.11.087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
14
|
BHUIYAN MDROMAN, Abdullah DJ, Hashim DN, Farid FA, Uddin DJ, Abdullah N, Samsudin DMA. Crowd density estimation using deep learning for Hajj pilgrimage video analytics. F1000Res 2021; 10:1190. [PMID: 35136582 PMCID: PMC8787568 DOI: 10.12688/f1000research.73156.2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/11/2022] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND This paper focuses on advances in crowd control study with an emphasis on high-density crowds, particularly Hajj crowds. Video analysis and visual surveillance have been of increasing importance in order to enhance the safety and security of pilgrimages in Makkah, Saudi Arabia. Hajj is considered to be a particularly distinctive event, with hundreds of thousands of people gathering in a small space, which does not allow a precise analysis of video footage using advanced video and computer vision algorithms. This research proposes an algorithm based on a Convolutional Neural Networks model specifically for Hajj applications. Additionally, the work introduces a system for counting and then estimating the crowd density. METHODS The model adopts an architecture which detects each person in the crowd, spots head location with a bounding box and does the counting in our own novel dataset (HAJJ-Crowd). RESULTS Our algorithm outperforms the state-of-the-art method, and attains a remarkable Mean Absolute Error result of 200 (average of 82.0 improvement) and Mean Square Error of 240 (average of 135.54 improvement). CONCLUSIONS In our new HAJJ-Crowd dataset for evaluation and testing, we have a density map and prediction results of some standard methods.
Collapse
Affiliation(s)
- MD ROMAN BHUIYAN
- FCI, Multimedia University, Persiaran Multimedia, Cyberjaya, 63100, Malaysia
| | - Dr Junaidi Abdullah
- FCI, Multimedia University, Persiaran Multimedia, Cyberjaya, 63100, Malaysia
| | - Dr Noramiza Hashim
- FCI, Multimedia University, Persiaran Multimedia, Cyberjaya, 63100, Malaysia
| | - Fahmid Al Farid
- FCI, Multimedia University, Persiaran Multimedia, Cyberjaya, 63100, Malaysia
| | - Dr Jia Uddin
- Technology Studies Department, Endicott College, Woosong University, Daejeon, 100-300, South Korea
| | - Norra Abdullah
- WSA Venture Australia (M) Sdn Bhd, Cyberjaya, 63000, Malaysia
| | | |
Collapse
|
15
|
Li P, Zhang M, Wan J, Jiang M. Multiscale Aggregate Networks with Dense Connections for Crowd Counting. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:9996232. [PMID: 34804153 PMCID: PMC8601827 DOI: 10.1155/2021/9996232] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 10/16/2021] [Accepted: 10/28/2021] [Indexed: 11/18/2022]
Abstract
The most advanced method for crowd counting uses a fully convolutional network that extracts image features and then generates a crowd density map. However, this process often encounters multiscale and contextual loss problems. To address these problems, we propose a multiscale aggregation network (MANet) that includes a feature extraction encoder (FEE) and a density map decoder (DMD). The FEE uses a cascaded scale pyramid network to extract multiscale features and obtains contextual features through dense connections. The DMD uses deconvolution and fusion operations to generate features containing detailed information. These features can be further converted into high-quality density maps to accurately calculate the number of people in a crowd. An empirical comparison using four mainstream datasets (ShanghaiTech, WorldExpo'10, UCF_CC_50, and SmartCity) shows that the proposed method is more effective in terms of the mean absolute error and mean squared error. The source code is available at https://github.com/lpfworld/MANet.
Collapse
Affiliation(s)
- Pengfei Li
- Hangzhou Dianzi University, Baiyang Road No. 2, Hangzhou, China
| | - Min Zhang
- Hangzhou Dianzi University, Baiyang Road No. 2, Hangzhou, China
| | - Jian Wan
- Hangzhou Dianzi University, Baiyang Road No. 2, Hangzhou, China
| | - Ming Jiang
- Hangzhou Dianzi University, Baiyang Road No. 2, Hangzhou, China
| |
Collapse
|
16
|
|
17
|
Peng S, Yin B, Hao X, Yang Q, Kumar A, Wang L. Depth and edge auxiliary learning for still image crowd density estimation. Pattern Anal Appl 2021. [DOI: 10.1007/s10044-021-01017-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
18
|
A Wide Area Multiview Static Crowd Estimation System Using UAV and 3D Training Simulator. REMOTE SENSING 2021. [DOI: 10.3390/rs13142780] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Crowd size estimation is a challenging problem, especially when the crowd is spread over a significant geographical area. It has applications in monitoring of rallies and demonstrations and in calculating the assistance requirements in humanitarian disasters. Therefore, accomplishing a crowd surveillance system for large crowds constitutes a significant issue. UAV-based techniques are an appealing choice for crowd estimation over a large region, but they present a variety of interesting challenges, such as integrating per-frame estimates through a video without counting individuals twice. Large quantities of annotated training data are required to design, train, and test such a system. In this paper, we have first reviewed several crowd estimation techniques, existing crowd simulators and data sets available for crowd analysis. Later, we have described a simulation system to provide such data, avoiding the need for tedious and error-prone manual annotation. Then, we have evaluated synthetic video from the simulator using various existing single-frame crowd estimation techniques. Our findings show that the simulated data can be used to train and test crowd estimation, thereby providing a suitable platform to develop such techniques. We also propose an automated UAV-based 3D crowd estimation system that can be used for approximately static or slow-moving crowds, such as public events, political rallies, and natural or man-made disasters. We evaluate the results by applying our new framework to a variety of scenarios with varying crowd sizes. The proposed system gives promising results using widely accepted metrics including MAE, RMSE, Precision, Recall, and F1 score to validate the results.
Collapse
|
19
|
Wang Y, Hou J, Hou X, Chau LP. A Self-Training Approach for Point-Supervised Object Detection and Counting in Crowds. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:2876-2887. [PMID: 33539297 DOI: 10.1109/tip.2021.3055632] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, we propose a novel self-training approach named Crowd-SDNet that enables a typical object detector trained only with point-level annotations (i.e., objects are labeled with points) to estimate both the center points and sizes of crowded objects. Specifically, during training, we utilize the available point annotations to supervise the estimation of the center points of objects directly. Based on a locally-uniform distribution assumption, we initialize pseudo object sizes from the point-level supervisory information, which are then leveraged to guide the regression of object sizes via a crowdedness-aware loss. Meanwhile, we propose a confidence and order-aware refinement scheme to continuously refine the initial pseudo object sizes such that the ability of the detector is increasingly boosted to detect and count objects in crowds simultaneously. Moreover, to address extremely crowded scenes, we propose an effective decoding method to improve the detector's representation ability. Experimental results on the WiderFace benchmark show that our approach significantly outperforms state-of-the-art point-supervised methods under both detection and counting tasks, i.e., our method improves the average precision by more than 10% and reduces the counting error by 31.2%. Besides, our method obtains the best results on the crowd counting and localization datasets (i.e., ShanghaiTech and NWPU-Crowd) and vehicle counting datasets (i.e., CARPK and PUCPR+) compared with state-of-the-art counting-by-detection methods. The code will be publicly available at https://github.com/WangyiNTU/Point-supervised-crowd-detection.
Collapse
|
20
|
Sun Y, Jin J, Wu X, Ma T, Yang J. Counting Crowds with Perspective Distortion Correction via Adaptive Learning. SENSORS 2020; 20:s20133781. [PMID: 32640552 PMCID: PMC7374275 DOI: 10.3390/s20133781] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2020] [Revised: 06/18/2020] [Accepted: 07/03/2020] [Indexed: 12/01/2022]
Abstract
The goal of crowd counting is to estimate the number of people in the image. Presently, use regression to count people number became a mainstream method. It is worth noting that, with the development of convolutional neural networks (CNN), methods that are based on CNN have become a research hotspot. It is a more interesting topic that how to locate the site of the person in the image than simply predicting the number of people in the image. The perspective transformation present is still a challenge, because perspective distortion will cause differences in the size of the crowd in the image. To devote perspective distortion and locate the site of the person more accuracy, we design a novel framework named Adaptive Learning Network (CAL). We use the VGG as the backbone. After each pooling layer is output, we collect the 1/2, 1/4, 1/8, and 1/16 features of the original image and combine them with the weights learned by an adaptive learning branch. The object of our adaptive learning branch is each image in the datasets. By combining the output features of different sizes of each image, the challenge of drastic changes in the size of the image crowd due to perspective transformation is reduced. We conducted experiments on four population counting data sets (i.e., ShanghaiTech Part A, ShanghaiTech Part B, UCF_CC_50 and UCF-QNRF), and the results show that our model has a good performance.
Collapse
|