1
|
Guo M, Chen B, Yan Z, Wang Y, Ye Q. Virtual Classification: Modulating Domain-Specific Knowledge for Multidomain Crowd Counting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2958-2972. [PMID: 38241099 DOI: 10.1109/tnnls.2024.3350363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2024]
Abstract
Multidomain crowd counting aims to learn a general model for multiple diverse datasets. However, deep networks prefer modeling distributions of the dominant domains instead of all domains, which is known as domain bias. In this study, we propose a simple-yet-effective modulating domain-specific knowledge network (MDKNet) to handle the domain bias issue in multidomain crowd counting. MDKNet is achieved by employing the idea of "modulating," enabling deep network balancing and modeling different distributions of diverse datasets with little bias. Specifically, we propose an instance-specific batch normalization (IsBN) module, which serves as a base modulator to refine the information flow to be adaptive to domain distributions. To precisely modulating the domain-specific information, the domain-guided virtual classifier (DVC) is then introduced to learn a domain-separable latent space. This space is employed as an input guidance for the IsBN modulator, such that the mixture distributions of multiple datasets can be well treated. Extensive experiments performed on popular benchmarks, including Shanghai-tech A/B, QNRF, and NWPU validate the superiority of MDKNet in tackling multidomain crowd counting and the effectiveness for multidomain learning. Code is available at https://github.com/csguomy/MDKNet.
Collapse
|
2
|
Wu Y, Li R, Qin Z, Zhao X, Li X. HeightFormer: Explicit Height Modeling Without Extra Data for Camera-Only 3D Object Detection in Bird's Eye View. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; 34:689-700. [PMID: 39250369 DOI: 10.1109/tip.2024.3427701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Vision-based Bird's Eye View (BEV) representation is an emerging perception formulation for autonomous driving. The core challenge is to construct BEV space with multi-camera features, which is a one-to-many ill-posed problem. Diving into all previous BEV representation generation methods, we found that most of them fall into two types: modeling depths in image views or modeling heights in the BEV space, mostly in an implicit way. In this work, we propose to explicitly model heights in the BEV space, which needs no extra data like LiDAR and can fit arbitrary camera rigs and types compared to modeling depths. Theoretically, we give proof of the equivalence between height-based methods and depth-based methods. Considering the equivalence and some advantages of modeling heights, we propose HeightFormer, which models heights and uncertainties in a self-recursive way. Without any extra data, the proposed HeightFormer could estimate heights in BEV accurately. Benchmark results show that the performance of HeightFormer achieves SOTA compared with those camera-only methods.
Collapse
|
3
|
Gao J, Huang Z, Lei Y, Shan H, Wang JZ, Wang FY, Zhang J. Deep Rank-Consistent Pyramid Model for Enhanced Crowd Counting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:299-312. [PMID: 38090870 DOI: 10.1109/tnnls.2023.3336774] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Most conventional crowd counting methods utilize a fully-supervised learning framework to establish a mapping between scene images and crowd density maps. They usually rely on a large quantity of costly and time-intensive pixel-level annotations for training supervision. One way to mitigate the intensive labeling effort and improve counting accuracy is to leverage large amounts of unlabeled images. This is attributed to the inherent self-structural information and rank consistency within a single image, offering additional qualitative relation supervision during training. Contrary to earlier methods that utilized the rank relations at the original image level, we explore such rank-consistency relation within the latent feature spaces. This approach enables the incorporation of numerous pyramid partial orders, strengthening the model representation capability. A notable advantage is that it can also increase the utilization ratio of unlabeled samples. Specifically, we propose a Deep Rank-consist Ent pyrAmid Model (DREAM), which makes full use of rank consistency across coarse-to-fine pyramid features in latent spaces for enhanced crowd counting with massive unlabeled images. In addition, we have collected a new unlabeled crowd counting dataset, FUDAN-UCC, comprising 4000 images for training purposes. Extensive experiments on four benchmark datasets, namely UCF-QNRF, ShanghaiTech PartA and PartB, and UCF-CC-50, show the effectiveness of our method compared with previous semi-supervised methods. The codes are available at https://github.com/bridgeqiqi/DREAM.
Collapse
|
4
|
Chen Z, Zhang S, Zheng X, Zhao X, Kong Y. Crowd Counting Based on Multiscale Spatial Guided Perception Aggregation Network. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:17465-17478. [PMID: 37610898 DOI: 10.1109/tnnls.2023.3304348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/25/2023]
Abstract
Crowd counting has received extensive attention in the field of computer vision, and methods based on deep convolutional neural networks (CNNs) have made great progress in this task. However, challenges such as scale variation, nonuniform distribution, complex background, and occlusion in crowded scenes hinder the performance of these networks in crowd counting. In order to overcome these challenges, this article proposes a multiscale spatial guidance perception aggregation network (MGANet) to achieve efficient and accurate crowd counting. MGANet consists of three parts: multiscale feature extraction network (MFEN), spatial guidance network (SGN), and attention fusion network (AFN). Specifically, to alleviate the scale variation problem in crowded scenes, MFEN is introduced to enhance the scale adaptability and effectively capture multiscale features in scenes with drastic scale variation. To address the challenges of nonuniform distribution and complex background in population, an SGN is proposed. The SGN includes two parts: the spatial context network (SCN) and the guidance perception network (GPN). SCN is used to capture the detailed semantic information between the multiscale feature positions extracted by MFEN, and improve the ability of deep structured information exploration. At the same time, the dependence relationship between the spatial remote context is established to enhance the receptive field. GPN is used to enhance the information exchange between channels and guide the network to select appropriate multiscale features and spatial context semantic features. AFN is used to adaptively measure the importance of the above different features, and obtain accurate and effective feature representations from them. In addition, this article proposes a novel region-adaptive loss function, which optimizes the regions with large recognition errors in the image, and alleviates the inconsistency between the training target and the evaluation metric. In order to evaluate the performance of the proposed method, extensive experiments were carried out on challenging benchmarks including ShanghaiTech Part A and Part B, UCF-CC-50, UCF-QNRF, and JHU-CROWD++. Experimental results show that the proposed method has good performance on all four datasets. Especially on ShanghaiTech Part A and Part B, CUCF-QNRF, and JHU-CROWD++ datasets, compared with the state-of-the-art methods, our proposed method achieves superior recognition performance and better robustness.
Collapse
|
5
|
Zhu J, Zhao W, Yao L, He Y, Hu M, Zhang X, Wang S, Li T, Lu H. Confusion Region Mining for Crowd Counting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18039-18051. [PMID: 37713223 DOI: 10.1109/tnnls.2023.3311020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/16/2023]
Abstract
Existing works mainly focus on crowd and ignore the confusion regions which contain extremely similar appearance to crowd in the background, while crowd counting needs to face these two sides at the same time. To address this issue, we propose a novel end-to-end trainable confusion region discriminating and erasing network called CDENet. Specifically, CDENet is composed of two modules of confusion region mining module (CRM) and guided erasing module (GEM). CRM consists of basic density estimation (BDE) network, confusion region aware bridge and confusion region discriminating network. The BDE network first generates a primary density map, and then the confusion region aware bridge excavates the confusion regions by comparing the primary prediction result with the ground-truth density map. Finally, the confusion region discriminating network learns the difference of feature representations in confusion regions and crowds. Furthermore, GEM gives the refined density map by erasing the confusion regions. We evaluate the proposed method on four crowd counting benchmarks, including ShanghaiTech Part_A, ShanghaiTech Part_B, UCF_CC_50, and UCF-QNRF, and our CDENet achieves superior performance compared with the state-of-the-arts.
Collapse
|
6
|
Ma C, Neri F, Gu L, Wang Z, Wang J, Qing A, Wang Y. Crowd Counting Using Meta-Test-Time Adaptation. Int J Neural Syst 2024; 34:2450061. [PMID: 39252679 DOI: 10.1142/s0129065724500618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Machine learning algorithms are commonly used for quickly and efficiently counting people from a crowd. Test-time adaptation methods for crowd counting adjust model parameters and employ additional data augmentation to better adapt the model to the specific conditions encountered during testing. The majority of current studies concentrate on unsupervised domain adaptation. These approaches commonly perform hundreds of epochs of training iterations, requiring a sizable number of unannotated data of every new target domain apart from annotated data of the source domain. Unlike these methods, we propose a meta-test-time adaptive crowd counting approach called CrowdTTA, which integrates the concept of test-time adaptation into the meta-learning framework and makes it easier for the counting model to adapt to the unknown test distributions. To facilitate the reliable supervision signal at the pixel level, we introduce uncertainty by inserting the dropout layer into the counting model. The uncertainty is then used to generate valuable pseudo labels, serving as effective supervisory signals for adapting the model. In the context of meta-learning, one image can be regarded as one task for crowd counting. In each iteration, our approach is a dual-level optimization process. In the inner update, we employ a self-supervised consistency loss function to optimize the model so as to simulate the parameters update process that occurs during the test phase. In the outer update, we authentically update the parameters based on the image with ground truth, improving the model's performance and making the pseudo labels more accurate in the next iteration. At test time, the input image is used for adapting the model before testing the image. In comparison to various supervised learning and domain adaptation methods, our results via extensive experiments on diverse datasets showcase the general adaptive capability of our approach across datasets with varying crowd densities and scales.
Collapse
Affiliation(s)
- Chaoqun Ma
- School of Electrical Engineering, Southwest Jiaotong University, Chengdu 611756, P. R. China
| | - Ferrante Neri
- NICE Group, School of Computer Science and Electronic Engineering, University of Surrey, Guildford, Surrey GU2 7XH, UK
| | - Li Gu
- Department of Computer Science and Software Engineering, Concordia University, Montreal, QC H3H 2L9, Canada
| | - Ziqiang Wang
- Department of Computer Science and Software Engineering, Concordia University, Montreal, QC H3H 2L9, Canada
| | - Jian Wang
- Faculty of Electric Power Engineering, Kunming University of Science and Technology, Kunming 650500, P. R. China
| | - Anyong Qing
- School of Electrical Engineering, Southwest Jiaotong University, Chengdu 611756, P. R. China
| | - Yang Wang
- Department of Computer Science and Software Engineering, Concordia University, Montreal, QC H3H 2L9, Canada
| |
Collapse
|
7
|
Chen J, Shi X, Zhang H, Li W, Li P, Yao Y, Miyazawa S, Song X, Shibasaki R. MobCovid: Confirmed Cases Dynamics Driven Time Series Prediction of Crowd in Urban Hotspot. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13397-13410. [PMID: 37200115 DOI: 10.1109/tnnls.2023.3268291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Monitoring the crowd in urban hot spot has been an important research topic in the field of urban management and has high social impact. It can allow more flexible allocation of public resources such as public transportation schedule adjustment and arrangement of police force. After 2020, because of the epidemic of COVID-19 virus, the public mobility pattern is deeply affected by the situation of epidemic as the physical close contact is the dominant way of infection. In this study, we propose a confirmed case-driven time-series prediction of crowd in urban hot spot named MobCovid. The model is a deviation of Informer, a popular time-serial prediction model proposed in 2021. The model takes both the number of nighttime staying people in downtown and confirmed cases of COVID-19 as input and predicts both the targets. In the current period of COVID, many areas and countries have relaxed the lockdown measures on public mobility. The outdoor travel of public is based on individual decision. Report of large amount of confirmed cases would restrict the public visitation of crowded downtown. But, still, government would publish some policies to try to intervene in the public mobility and control the spread of virus. For example, in Japan, there are no compulsory measures to force people to stay at home, but measures to persuade people to stay away from downtown area. Therefore, we also merge the encoding of policies on measures of mobility restriction made by government in the model to improve the precision. We use historical data of nighttime staying people in crowded downtown and confirmed cases of Tokyo and Osaka area as study case. Multiple times of comparison with other baselines including the original Informer model prove the effectiveness of our proposed method. We believe our work can make contribution to the current knowledge on forecasting the number of crowd in urban downtown during the Covid epidemic.
Collapse
|
8
|
Dong L, Zhang H, Ma J, Xu X, Yang Y, Wu QMJ. CLRNet: A Cross Locality Relation Network for Crowd Counting in Videos. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:6408-6422. [PMID: 36215378 DOI: 10.1109/tnnls.2022.3209918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
In this article, we propose a new cross locality relation network (CLRNet) to generate high-quality crowd density maps for crowd counting in videos. Specifically, a cross locality relation module (CLRM) is proposed to enhance feature representations by modeling local dependencies of pixels between adjacent frames with an adapted local self-attention mechanism. First, different from the existing methods which measure similarity between pixels by dot product, a new adaptive cosine similarity is advanced to measure the relationship between two positions. Second, the traditional self-attention modules usually integrate the reconstructed features with the same weights for all the positions. However, crowd movement and background changes in a video sequence are uneven in real-life applications. As a consequence, it is inappropriate to treat all the positions in reconstructed features equally. To address this issue, a scene consistency attention map (SCAM) is developed to make CLRM pay more attention to the positions with strong correlations in adjacent frames. Furthermore, CLRM is incorporated into the network in a coarse-to-fine way to further enhance the representational capability of features. Experimental results demonstrate the effectiveness of our proposed CLRNet in comparison to the state-of-the-art methods on four public video datasets. The codes are available at: https://github.com/Amelie01/CLRNet.
Collapse
|
9
|
Lu H, Liu L, Wang H, Cao Z. Counting Crowd by Weighing Counts: A Sequential Decision-Making Perspective. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5141-5154. [PMID: 36094991 DOI: 10.1109/tnnls.2022.3202652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We show that crowd counting can be formulated as a sequential decision-making (SDM) problem. Inspired by human counting, we evade one-step estimation mostly executed in existing counting models and decompose counting into sequential sub-decision problems. During implementation, a key insight is to interpret sequential counting as a physical process in reality-scale weighing. This analogy allows us to implement a novel "counting scale" termed LibraNet. Our idea is that, by placing a crowd image on the scale, LibraNet (agent) learns to place appropriate weights to match the count: at each step, one weight (action) is chosen from the weight box (the predefined action pool) conditioned on the image features and the placed weights (state) until the pointer (the agent output) informs balance. We investigate two forms of state definition and explore four types of LibraNet implementations under different learning paradigms, including deep Q-network (DQN), actor-critic (AC), imitation learning (IL), and mixed AC+IL. Experiments show that LibraNet indeed mimics scale weighing, that it outperforms or performs comparably against state-of-the-art approaches on five crowd counting benchmarks, that it can be used as a plug-in to improve off-the-shelf counting models, and particularly that it demonstrates remarkable cross-dataset generalization. Code and models are available at https://git.io/libranet.
Collapse
|
10
|
Luo Y, Lu J, Jiang X, Zhang B. Learning From Architectural Redundancy: Enhanced Deep Supervision in Deep Multipath Encoder-Decoder Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4271-4284. [PMID: 33587717 DOI: 10.1109/tnnls.2021.3056384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Deep encoder-decoders are the model of choice for pixel-level estimation due to their redundant deep architectures. Yet they still suffer from the vanishing supervision information issue that affects convergence because of their overly deep architectures. In this work, we propose and theoretically derive an enhanced deep supervision (EDS) method which improves on conventional deep supervision (DS) by incorporating variance minimization into the optimization. A new structure variance loss is introduced to build a bridge between deep encoder-decoders and variance minimization, and provides a new way to minimize the variance by forcing different intermediate decoding outputs (paths) to reach an agreement. We also design a focal weighting strategy to effectively combine multiple losses in a scale-balanced way, so that the supervision information is sufficiently enforced throughout the encoder-decoders. To evaluate the proposed method on the pixel-level estimation task, a novel multipath residual encoder is proposed and extensive experiments are conducted on four challenging density estimation and crowd counting benchmarks. The experimental results demonstrate the superiority of our EDS over other paradigms, and improved estimation performance is reported using our deeply supervised encoder-decoder.
Collapse
|
11
|
Meng C, Kang C, Lyu L. Hierarchical feature aggregation network with semantic attention for counting large‐scale crowd. INT J INTELL SYST 2022. [DOI: 10.1002/int.23023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Chen Meng
- School of Information Science and Engineering Shandong Normal University Jinan Shandong China
| | - Chunmeng Kang
- School of Information Science and Engineering Shandong Normal University Jinan Shandong China
- Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology Jinan Shandong China
| | - Lei Lyu
- School of Information Science and Engineering Shandong Normal University Jinan Shandong China
- Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology Jinan Shandong China
| |
Collapse
|
12
|
Wang Q, Han T, Gao J, Yuan Y. Neuron Linear Transformation: Modeling the Domain Shift for Crowd Counting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:3238-3250. [PMID: 33502985 DOI: 10.1109/tnnls.2021.3051371] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Cross-domain crowd counting (CDCC) is a hot topic due to its importance in public safety. The purpose of CDCC is to alleviate the domain shift between the source and target domain. Recently, typical methods attempt to extract domain-invariant features via image translation and adversarial learning. When it comes to specific tasks, we find that the domain shifts are reflected in model parameters' differences. To describe the domain gap directly at the parameter level, we propose a neuron linear transformation (NLT) method, exploiting domain factor and bias weights to learn the domain shift. Specifically, for a specific neuron of a source model, NLT exploits few labeled target data to learn domain shift parameters. Finally, the target neuron is generated via a linear transformation. Extensive experiments and analysis on six real-world data sets validate that NLT achieves top performance compared with other domain adaptation methods. An ablation study also shows that the NLT is robust and more effective than supervised and fine-tune training. Code is available at https://github.com/taohan10200/NLT.
Collapse
|
13
|
Zhong X, Qin J, Guo M, Zuo W, Lu W. Offset-decoupled deformable convolution for efficient crowd counting. Sci Rep 2022; 12:12229. [PMID: 35851829 PMCID: PMC9293988 DOI: 10.1038/s41598-022-16415-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 07/11/2022] [Indexed: 11/09/2022] Open
Abstract
Crowd counting is considered a challenging issue in computer vision. One of the most critical challenges in crowd counting is considering the impact of scale variations. Compared with other methods, better performance is achieved with CNN-based methods. However, given the limit of fixed geometric structures, the head-scale features are not completely obtained. Deformable convolution with additional offsets is widely used in the fields of image classification and pattern recognition, as it can successfully exploit the potential of spatial information. However, owing to the randomly generated parameters of offsets in network initialization, the sampling points of the deformable convolution are disorderly stacked, weakening the effectiveness of feature extraction. To handle the invalid learning of offsets and the inefficient utilization of deformable convolution, an offset-decoupled deformable convolution (ODConv) is proposed in this paper. It can completely obtain information within the effective region of sampling points, leading to better performance. In extensive experiments, average MAE of 62.3, 8.3, 91.9, and 159.3 are achieved using our method on the ShanghaiTech A, ShanghaiTech B, UCF-QNRF, and UCF_CC_50 datasets, respectively, outperforming the state-of-the-art methods and validating the effectiveness of the proposed ODConv.
Collapse
Affiliation(s)
- Xin Zhong
- Department of Educational Technology, Ocean University of China, Qingdao, 266100, China
| | - Jing Qin
- Department of Educational Technology, Ocean University of China, Qingdao, 266100, China
| | - Mingyue Guo
- Department of Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Wangmeng Zuo
- Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Weigang Lu
- Department of Educational Technology, Ocean University of China, Qingdao, 266100, China.
| |
Collapse
|
14
|
|
15
|
Gao J, Yuan Y, Wang Q. Feature-Aware Adaptation and Density Alignment for Crowd Counting in Video Surveillance. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4822-4833. [PMID: 33259318 DOI: 10.1109/tcyb.2020.3034316] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
With the development of deep neural networks, the performance of crowd counting and pixel-wise density estimation is continually being refreshed. Despite this, there are still two challenging problems in this field: 1) current supervised learning needs a large amount of training data, but collecting and annotating them is difficult and 2) existing methods cannot generalize well to the unseen domain. A recently released synthetic crowd dataset alleviates these two problems. However, the domain gap between the real-world data and synthetic images decreases the models' performance. To reduce the gap, in this article, we propose a domain-adaptation-style crowd counting method, which can effectively adapt the model from synthetic data to the specific real-world scenes. It consists of multilevel feature-aware adaptation (MFA) and structured density map alignment (SDA). To be specific, MFA boosts the model to extract domain-invariant features from multiple layers. SDA guarantees the network outputs fine density maps with a reasonable distribution on the real domain. Finally, we evaluate the proposed method on four mainstream surveillance crowd datasets, Shanghai Tech Part B, WorldExpo'10, Mall, and UCSD. Extensive experiments are evidence that our approach outperforms the state-of-the-art methods for the same cross-domain counting problem.
Collapse
|
16
|
Yang Y, Li G, Du D, Huang Q, Sebe N. Embedding Perspective Analysis Into Multi-Column Convolutional Neural Network for Crowd Counting. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 30:1395-1407. [PMID: 33315562 DOI: 10.1109/tip.2020.3043122] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The crowd counting is challenging for deep networks due to several factors. For instance, the networks can not efficiently analyze the perspective information of arbitrary scenes, and they are naturally inefficient to handle the scale variations. In this work, we deliver a simple yet efficient multi-column network, which integrates the perspective analysis method with the counting network. The proposed method explicitly excavates the perspective information and drives the counting network to analyze the scenes. More concretely, we explore the perspective information from the estimated density maps and quantify the perspective space into several separate scenes. We then embed the perspective analysis into the multi-column framework with a recurrent connection. Therefore, the proposed network matches various scales with the different receptive fields efficiently. Secondly, we share the parameters of the branches with various receptive fields. This strategy drives the convolutional kernels to be sensitive to the instances with various scales. Furthermore, to improve the evaluation accuracy of the column with a large receptive field, we propose a transform dilated convolution. The transform dilated convolution breaks the fixed sampling structure of the deep network. Moreover, it needs no extra parameters and training, and the offsets are constrained in a local region, which is designed for the congested scenes. The proposed method achieves state-of-the-art performance on five datasets (ShanghaiTech, UCF CC 50, WorldEXPO'10, UCSD, and TRANCOS).
Collapse
|
17
|
Peng S, Wang L, Yin B, Li Y, Xia Y, Hao X. Adaptive weighted crowd receptive field network for crowd counting. Pattern Anal Appl 2020. [DOI: 10.1007/s10044-020-00934-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|