1
|
Li Y, Xiao Z, Yang L, Meng D, Zhou X, Fan H, Zhang L. AttMOT: Improving Multiple-Object Tracking by Introducing Auxiliary Pedestrian Attributes. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5454-5468. [PMID: 38662556 DOI: 10.1109/tnnls.2024.3384446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Multiobject tracking (MOT) is a fundamental problem in computer vision with numerous applications, such as intelligent surveillance and automated driving. Despite the significant progress made in MOT, pedestrian attributes, such as gender, hairstyle, body shape, and clothing features, which contain rich and high-level information, have been less explored. To address this gap, we propose a simple, effective, and generic method to predict pedestrian attributes to support general reidentification (Re-ID) embedding. We first introduce attribute multi-object tracking (AttMOT), a large, highly enriched synthetic dataset for pedestrian tracking, containing over 80k frames and six million pedestrian identity switches (IDs) with different times, weather conditions, and scenarios. To the best of authors' knowledge, AttMOT is the first MOT dataset with semantic attributes. Subsequently, we explore different approaches to fuse Re-ID embedding and pedestrian attributes, including attention mechanisms, which we hope will stimulate the development of attribute-assisted MOT. The proposed method attribute-assisted method (AAM) demonstrates its effectiveness and generality on several representative pedestrian MOT benchmarks, including MOT17 and MOT20, through experiments on the AttMOT dataset. When applied to the state-of-the-art trackers, AAM achieves consistent improvements in multi-object tracking accuracy (MOTA), higher order tracking accuracy (HOTA), association accuracy (AssA), IDs, and IDF1 scores. For instance, on MOT17, the proposed method yields a +1.1 MOTA, +1.7 HOTA, and +1.8 IDF1 improvement when used with FairMOT. To further encourage related research, we release the data and code at https://github.com/HengLan/AttMOT.
Collapse
|
2
|
Jiang Z, Liu Z, Chen L, Tong L, Zhang X, Lan X, Crookes D, Yang MH, Zhou H. Detecting and Tracking of Multiple Mice Using Part Proposal Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9806-9820. [PMID: 35349456 DOI: 10.1109/tnnls.2022.3160800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The study of mouse social behaviors has been increasingly undertaken in neuroscience research. However, automated quantification of mouse behaviors from the videos of interacting mice is still a challenging problem, where object tracking plays a key role in locating mice in their living spaces. Artificial markers are often applied for multiple mice tracking, which are intrusive and consequently interfere with the movements of mice in a dynamic environment. In this article, we propose a novel method to continuously track several mice and individual parts without requiring any specific tagging. First, we propose an efficient and robust deep-learning-based mouse part detection scheme to generate part candidates. Subsequently, we propose a novel Bayesian-inference integer linear programming (BILP) model that jointly assigns the part candidates to individual targets with necessary geometric constraints while establishing pair-wise association between the detected parts. There is no publicly available dataset in the research community that provides a quantitative test bed for part detection and tracking of multiple mice, and we here introduce a new challenging Multi-Mice PartsTrack dataset that is made of complex behaviors. Finally, we evaluate our proposed approach against several baselines on our new datasets, where the results show that our method outperforms the other state-of-the-art approaches in terms of accuracy. We also demonstrate the generalization ability of the proposed approach on tracking zebra and locust.
Collapse
|
3
|
Chang X, Pan H, Sun W, Gao H. YolTrack: Multitask Learning Based Real-Time Multiobject Tracking and Segmentation for Autonomous Vehicles. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:5323-5333. [PMID: 33577460 DOI: 10.1109/tnnls.2021.3056383] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Modern autonomous vehicles are required to perform various visual perception tasks for scene construction and motion decision. The multiobject tracking and instance segmentation (MOTS) are the main tasks since they directly influence the steering and braking of the car. Implementing both tasks using a multitask learning neural network presents significant challenges in performance and complexity. Current work on MOTS devotes to improve the precision of the network with a two-stage tracking by detection model, which is difficult to satisfy the real-time requirement of autonomous vehicles. In this article, a real-time multitask network named YolTrack based on one-stage instance segmentation model is proposed to perform the MOTS task, achieving an inference speed of 29.5 frames per second (fps) with slight accuracy and precision drop. The YolTrack uses ShuffleNet V2 with feature pyramid network (FPN) as a backbone, from which two decoders are extended to generate instance segments and embedding vectors. Segmentation masks are used to improve the tracking performance by performing logic AND operation with feature maps, proving that foreground segmentation plays an important role in object tracking. The different scales of multiple tasks are balanced by the optimized geometric mean loss during the training phase. Experimental results on the KITTI MOTS data set show that YolTrack outperforms other state-of-the-art MOTS architectures in real-time aspect and is appropriate for deployment in autonomous vehicles.
Collapse
|
4
|
Li CJ, Qu Z, Wang SY, Liu L. A method of cross-layer fusion multi-object detection and recognition based on improved faster R-CNN model in complex traffic environment. Pattern Recognit Lett 2021. [DOI: 10.1016/j.patrec.2021.02.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
5
|
Li X, Chen M, Wang Q. Quantifying and Detecting Collective Motion in Crowd Scenes. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:5571-5583. [PMID: 32286982 DOI: 10.1109/tip.2020.2985284] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
People in crowd scenes always exhibit consistent behaviors and form collective motions. The analysis of collective motion has motivated a surge of interest in computer vision. Nevertheless, the effort is hampered by the complex nature of collective motions. Considering the fact that collective motions are formed by individuals, this paper proposes a new framework for both quantifying and detecting collective motion by investigating the spatio-temporal behavior of individuals. The main contributions of this work are threefold: 1) an intention-aware model is built to fully capture the intrinsic dynamics of individuals; 2) a structure-based collectiveness measurement is developed to accurately quantify the collective properties of crowds; 3) a multistage clustering strategy is formulated to detect both the local and global behavior consistency in crowd scenes. Experiments on real world data sets show that our method is able to handle crowds with various structures and time-varying dynamics. Especially, the proposed method shows nearly 10% improvement over the competitors in terms of NMI, Purity and RI. Its applicability is illustrated in the context of anomaly detection and semantic scene segmentation.
Collapse
|
6
|
Li K, Kong Y, Fu Y. Visual Object Tracking via Multi-Stream Deep Similarity Learning Networks. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:3311-3320. [PMID: 31869790 DOI: 10.1109/tip.2019.2959249] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Visual tracking remains a challenging research problem because of appearance variations of the object over time, changing cluttered background and requirement for real-time speed. In this paper, we investigate the problem of real-time accurate tracking in a instance-level tracking-by-verification mechanism. We propose a multi-stream deep similarity learning network to learn a similarity comparison model purely off-line. Our loss function encourages the distance between a positive patch and the background patches to be larger than that between the positive patch and the target template. Then, the learned model is directly used to determine the patch in each frame that is most distinctive to the background context and similar to the target template. Within the learned feature space, even if the distance between positive patches becomes large caused by the interference of background clutter, impact from hard distractors from the same class or the appearance change of the target, our method can still distinguish the target robustly using the relative distance. Besides, we also propose a complete framework considering the recovery from failures and the template updating to further improve the tracking performance without taking too much computing resource. Experiments on visual tracking benchmarks show the effectiveness of the proposed tracker when comparing with several recent real-time-speed trackers as well as trackers already included in the benchmarks.
Collapse
|
7
|
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT. From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:3047-3058. [PMID: 30130235 DOI: 10.1109/tnnls.2018.2851077] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Video captioning, in essential, is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, and so on. In this paper, we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multimodal stochastic recurrent neural networks (MS-RNNs), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning and generate multiple sentences to describe a video considering different random factors. Specifically, a multimodal long short-term memory (LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging data sets, microsoft video description and microsoft research video-to-text, show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.
Collapse
|
8
|
Cao X, Qiu B, Li X, Shi Z, Xu G, Xu J. Multidimensional Balance-Based Cluster Boundary Detection for High-Dimensional Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:1867-1880. [PMID: 30387747 DOI: 10.1109/tnnls.2018.2874458] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The balance of neighborhood space around a central point is an important concept in cluster analysis. It can be used to effectively detect cluster boundary objects. The existing neighborhood analysis methods focus on the distribution of data, i.e., analyzing the characteristic of the neighborhood space from a single perspective, and could not obtain rich data characteristics. In this paper, we analyze the high-dimensional neighborhood space from multiple perspectives. By simulating each dimension of a data point's k nearest neighbors space ( k NNs) as a lever, we apply the lever principle to compute the balance fulcrum of each dimension after proving its inevitability and uniqueness. Then, we model the distance between the projected coordinate of the data point and the balance fulcrum on each dimension and construct the DHBlan coefficient to measure the balance of the neighborhood space. Based on this theoretical model, we propose a simple yet effective cluster boundary detection algorithm called Lever. Experiments on both low- and high-dimensional data sets validate the effectiveness and efficiency of our proposed algorithm.
Collapse
|
9
|
Hu H, Ma B, Shen J, Shao L. Manifold Regularized Correlation Object Tracking. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:1786-1795. [PMID: 28422697 DOI: 10.1109/tnnls.2017.2688448] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In this paper, we propose a manifold regularized correlation tracking method with augmented samples. To make better use of the unlabeled data and the manifold structure of the sample space, a manifold regularization-based correlation filter is introduced, which aims to assign similar labels to neighbor samples. Meanwhile, the regression model is learned by exploiting the block-circulant structure of matrices resulting from the augmented translated samples over multiple base samples cropped from both target and nontarget regions. Thus, the final classifier in our method is trained with positive, negative, and unlabeled base samples, which is a semisupervised learning framework. A block optimization strategy is further introduced to learn a manifold regularization-based correlation filter for efficient online tracking. Experiments on two public tracking data sets demonstrate the superior performance of our tracker compared with the state-of-the-art tracking approaches.
Collapse
|
10
|
Li X, Liu L, Lu X. Person Reidentification Based on Elastic Projections. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:1314-1327. [PMID: 28422688 DOI: 10.1109/tnnls.2016.2602855] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Person reidentification usually refers to matching people in different camera views in nonoverlapping multicamera networks. Many existing methods learn a similarity measure by projecting the raw feature to a latent subspace to make the same target's distance smaller than different targets' distances. However, the same targets captured in different camera views should hold the same intrinsic attributes while different targets should hold different intrinsic attributes. Projecting all the data to the same subspace would cause loss of such an information and comparably poor discriminability. To address this problem, in this paper, a method based on elastic projections is proposed to learn a pairwise similarity measure for person reidentification. The proposed model learns two projections, positive projection and negative projection, which are both representative and discriminative. The representability refers to: for the same targets captured in two camera views, the positive projection can bridge the corresponding appearance variation and represent the intrinsic attributes of the same targets, while for the different targets captured in two camera views, the negative projection can explore and utilize the different attributes of different targets. The discriminability means that the intraclass distance should become smaller than its original distance after projection, while the interclass distance becomes larger on the contrary, which is the elastic property of the proposed model. In this case, prior information of the original data space is used to give guidance for the learning phase; more importantly, similar targets (but not the same) are effectively reduced by forcing the same targets to become more similar and different targets to become more distinct. The proposed model is evaluated on three benchmark data sets, including VIPeR, GRID, and CUHK, and achieves better performance than other methods.
Collapse
|
11
|
Zhang S, Lan X, Yao H, Zhou H, Tao D, Li X. A Biologically Inspired Appearance Model for Robust Visual Tracking. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:2357-2370. [PMID: 27448375 DOI: 10.1109/tnnls.2016.2586194] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this paper, we propose a biologically inspired appearance model for robust visual tracking. Motivated in part by the success of the hierarchical organization of the primary visual cortex (area V1), we establish an architecture consisting of five layers: whitening, rectification, normalization, coding, and pooling. The first three layers stem from the models developed for object recognition. In this paper, our attention focuses on the coding and pooling layers. In particular, we use a discriminative sparse coding method in the coding layer along with spatial pyramid representation in the pooling layer, which makes it easier to distinguish the target to be tracked from its background in the presence of appearance variations. An extensive experimental study shows that the proposed method has higher tracking accuracy than several state-of-the-art trackers.
Collapse
|
12
|
Kumar M, Bhatnagar C. Zero-stopping constraint-based hybrid tracking model for dynamic and high-dense crowd videos. THE IMAGING SCIENCE JOURNAL 2017. [DOI: 10.1080/13682199.2016.1265707] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
13
|
Yeh JF, Tan YS, Lee CH. Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.08.017] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
14
|
He Z, Li X, You X, Tao D, Tang YY. Connected Component Model for Multi-Object Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:3698-3711. [PMID: 27214900 DOI: 10.1109/tip.2016.2570553] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
In multi-object tracking, it is critical to explore the data associations by exploiting the temporal information from a sequence of frames rather than the information from the adjacent two frames. Since straightforwardly obtaining data associations from multi-frames is an NP-hard multi-dimensional assignment (MDA) problem, most existing methods solve this MDA problem by either developing complicated approximate algorithms, or simplifying MDA as a 2D assignment problem based upon the information extracted only from adjacent frames. In this paper, we show that the relation between associations of two observations is the equivalence relation in the data association problem, based on the spatial-temporal constraint that the trajectories of different objects must be disjoint. Therefore, the MDA problem can be equivalently divided into independent subproblems by equivalence partitioning. In contrast to existing works for solving the MDA problem, we develop a connected component model (CCM) by exploiting the constraints of the data association and the equivalence relation on the constraints. Based upon CCM, we can efficiently obtain the global solution of the MDA problem for multi-object tracking by optimizing a sequence of independent data association subproblems. Experiments on challenging public data sets demonstrate that our algorithm outperforms the state-of-the-art approaches.
Collapse
|
15
|
Philip FM, Mukesh R. Hybrid tracking model for multiple object videos using second derivative based visibility model and tangential weighted spatial tracking model. INT J COMPUT INT SYS 2016. [DOI: 10.1080/18756891.2016.1237188] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
16
|
Zhang X, Guan N, Tao D, Qiu X, Luo Z. Online multi-modal robust non-negative dictionary learning for visual tracking. PLoS One 2015; 10:e0124685. [PMID: 25961715 PMCID: PMC4427315 DOI: 10.1371/journal.pone.0124685] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2014] [Accepted: 03/17/2015] [Indexed: 11/19/2022] Open
Abstract
Dictionary learning is a method of acquiring a collection of atoms for subsequent signal representation. Due to its excellent representation ability, dictionary learning has been widely applied in multimedia and computer vision. However, conventional dictionary learning algorithms fail to deal with multi-modal datasets. In this paper, we propose an online multi-modal robust non-negative dictionary learning (OMRNDL) algorithm to overcome this deficiency. Notably, OMRNDL casts visual tracking as a dictionary learning problem under the particle filter framework and captures the intrinsic knowledge about the target from multiple visual modalities, e.g., pixel intensity and texture information. To this end, OMRNDL adaptively learns an individual dictionary, i.e., template, for each modality from available frames, and then represents new particles over all the learned dictionaries by minimizing the fitting loss of data based on M-estimation. The resultant representation coefficient can be viewed as the common semantic representation of particles across multiple modalities, and can be utilized to track the target. OMRNDL incrementally learns the dictionary and the coefficient of each particle by using multiplicative update rules to respectively guarantee their non-negativity constraints. Experimental results on a popular challenging video benchmark validate the effectiveness of OMRNDL for visual tracking in both quantity and quality.
Collapse
Affiliation(s)
- Xiang Zhang
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, China
| | - Naiyang Guan
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, China
| | - Dacheng Tao
- The Centre for Quantum Computation & Intelligent Systems and the Faculty of Engineering and Information Technology, University of Technology, Sydney, 81 Broadway Street, Ultimo, NSW 2007, Australia
- * E-mail:
| | - Xiaogang Qiu
- College of Information System and Management, National University of Defense Technology, Changsha, Hunan, 410073 China
| | - Zhigang Luo
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, China
| |
Collapse
|