1
|
Sikdar A, Liu Y, Kedarisetty S, Zhao Y, Ahmed A, Behera A. Interweaving Insights: High-Order Feature Interaction for Fine-Grained Visual Recognition. Int J Comput Vis 2024; 133:1755-1779. [PMID: 40160952 PMCID: PMC11953118 DOI: 10.1007/s11263-024-02260-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 09/26/2024] [Indexed: 04/02/2025]
Abstract
This paper presents a novel approach for Fine-Grained Visual Classification (FGVC) by exploring Graph Neural Networks (GNNs) to facilitate high-order feature interactions, with a specific focus on constructing both inter- and intra-region graphs. Unlike previous FGVC techniques that often isolate global and local features, our method combines both features seamlessly during learning via graphs. Inter-region graphs capture long-range dependencies to recognize global patterns, while intra-region graphs delve into finer details within specific regions of an object by exploring high-dimensional convolutional features. A key innovation is the use of shared GNNs with an attention mechanism coupled with the Approximate Personalized Propagation of Neural Predictions (APPNP) message-passing algorithm, enhancing information propagation efficiency for better discriminability and simplifying the model architecture for computational efficiency. Additionally, the introduction of residual connections improves performance and training stability. Comprehensive experiments showcase state-of-the-art results on benchmark FGVC datasets, affirming the efficacy of our approach. This work underscores the potential of GNN in modeling high-level feature interactions, distinguishing it from previous FGVC methods that typically focus on singular aspects of feature representation. Our source code is available at https://github.com/Arindam-1991/I2-HOFI.
Collapse
Affiliation(s)
- Arindam Sikdar
- Department of Computer Science, Edge Hill University, Ormskirk, UK
| | - Yonghuai Liu
- Department of Computer Science, Edge Hill University, Ormskirk, UK
| | - Siddhardha Kedarisetty
- Department of Aerospace Engineering, Technion—Israel Institute of Technology, Haifa, Israel
| | - Yitian Zhao
- Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, China
| | - Amr Ahmed
- Department of Computer Science, Edge Hill University, Ormskirk, UK
| | - Ardhendu Behera
- Department of Computer Science, Edge Hill University, Ormskirk, UK
| |
Collapse
|
2
|
Sun J, Su J, Yan Z, Gao Z, Sun Y, Liu L. Truck model recognition for an automatic overload detection system based on the improved MMAL-Net. Front Neurosci 2023; 17:1243847. [PMID: 37638309 PMCID: PMC10448051 DOI: 10.3389/fnins.2023.1243847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 07/31/2023] [Indexed: 08/29/2023] Open
Abstract
Efficient and reliable transportation of goods through trucks is crucial for road logistics. However, the overloading of trucks poses serious challenges to road infrastructure and traffic safety. Detecting and preventing truck overloading is of utmost importance for maintaining road conditions and ensuring the safety of both road users and goods transported. This paper introduces a novel method for detecting truck overloading. The method utilizes the improved MMAL-Net for truck model recognition. Vehicle identification involves using frontal and side truck images, while APPM is applied for local segmentation of the side image to recognize individual parts. The proposed method analyzes the captured images to precisely identify the models of trucks passing through automatic weighing stations on the highway. The improved MMAL-Net achieved an accuracy of 95.03% on the competitive benchmark dataset, Stanford Cars, demonstrating its superiority over other established methods. Furthermore, our method also demonstrated outstanding performance on a small-scale dataset. In our experimental evaluation, our method achieved a recognition accuracy of 85% when the training set consisted of 20 sets of photos, and it reached 100% as the training set gradually increased to 50 sets of samples. Through the integration of this recognition system with weight data obtained from weighing stations and license plates information, the method enables real-time assessment of truck overloading. The implementation of the proposed method is of vital importance for multiple aspects related to road traffic safety.
Collapse
Affiliation(s)
- Jiachen Sun
- Shanghai Key Laboratory of Intelligent Manufacturing and Robotics, School of Mechatronic Engineering and Automation, Shanghai University, Shanghai, China
| | - Jin Su
- College of Information Engineering, Lanzhou University of Finance and Economics, Lanzhou, China
| | - Zhenhao Yan
- Shanghai Key Laboratory of Intelligent Manufacturing and Robotics, School of Mechatronic Engineering and Automation, Shanghai University, Shanghai, China
| | - Zenggui Gao
- Shanghai Key Laboratory of Intelligent Manufacturing and Robotics, School of Mechatronic Engineering and Automation, Shanghai University, Shanghai, China
| | - Yanning Sun
- Shanghai Key Laboratory of Intelligent Manufacturing and Robotics, School of Mechatronic Engineering and Automation, Shanghai University, Shanghai, China
| | - Lilan Liu
- Shanghai Key Laboratory of Intelligent Manufacturing and Robotics, School of Mechatronic Engineering and Automation, Shanghai University, Shanghai, China
| |
Collapse
|
3
|
Chen W, Wang Y, Tang X, Yan P, Liu X, Lin L, Shi G, Robert E, Huang F. A specific fine-grained identification model for plasma-treated rice growth using multiscale shortcut convolutional neural network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:10223-10243. [PMID: 37322930 DOI: 10.3934/mbe.2023448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
As an agricultural innovation, low-temperature plasma technology is an environmentally friendly green technology that increases crop quality and productivity. However, there is a lack of research on the identification of plasma-treated rice growth. Although traditional convolutional neural networks (CNN) can automatically share convolution kernels and extract features, the outputs are only suitable for entry-level categorization. Indeed, shortcuts from the bottom layers to fully connected layers can be established feasibly in order to utilize spatial and local information from the bottom layers, which contain small distinctions necessary for fine-grain identification. In this work, 5000 original images which contain the basic growth information of rice (including plasma treated rice and the control rice) at the tillering stage were collected. An efficient multiscale shortcut CNN (MSCNN) model utilizing key information and cross-layer features was proposed. The results show that MSCNN outperforms the mainstream models in terms of accuracy, recall, precision and F1 score with 92.64%, 90.87%, 92.88% and 92.69%, respectively. Finally, the ablation experiment, comparing the average precision of MSCNN with and without shortcuts, revealed that the MSCNN with three shortcuts achieved the best performance with the highest precision.
Collapse
Affiliation(s)
- Wenzhuo Chen
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Yuan Wang
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Xiaojiang Tang
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Pengfei Yan
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Xin Liu
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Lianfeng Lin
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Guannan Shi
- College of Information and Electrical Engineering, China Agricultural University, Beijing, 10083, China
| | - Eric Robert
- GREMI, UMR 7344, CNRS/Université d'Orléans, 45067 Orléans Cedex France
| | - Feng Huang
- College of Science, China Agricultural University, Beijing 100083, China
- GREMI, UMR 7344, CNRS/Université d'Orléans, 45067 Orléans Cedex France
- LE STUDIUM Loire Valley Institute for Advanced Studies, Centre-Val de Loire region, France
| |
Collapse
|
4
|
Wei XS, Song YZ, Aodha OM, Wu J, Peng Y, Tang J, Yang J, Belongie S. Fine-Grained Image Analysis With Deep Learning: A Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:8927-8948. [PMID: 34752384 DOI: 10.1109/tpami.2021.3126648] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Fine-grained image analysis (FGIA) is a longstanding and fundamental problem in computer vision and pattern recognition, and underpins a diverse set of real-world applications. The task of FGIA targets analyzing visual objects from subordinate categories, e.g., species of birds or models of cars. The small inter-class and large intra-class variation inherent to fine-grained image analysis makes it a challenging problem. Capitalizing on advances in deep learning, in recent years we have witnessed remarkable progress in deep learning powered FGIA. In this paper we present a systematic survey of these advances, where we attempt to re-define and broaden the field of FGIA by consolidating two fundamental fine-grained research areas - fine-grained image recognition and fine-grained image retrieval. In addition, we also review other key issues of FGIA, such as publicly available benchmark datasets and related domain-specific applications. We conclude by highlighting several research directions and open problems which need further exploration from the community.
Collapse
|
5
|
Kumar P, Raghavendran S, Silambarasan K, Kannan KS, Krishnan N. Mobile application using DCDM and cloud-based automatic plant disease detection. ENVIRONMENTAL MONITORING AND ASSESSMENT 2022; 195:44. [PMID: 36302915 DOI: 10.1007/s10661-022-10561-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 12/31/2021] [Indexed: 06/16/2023]
Abstract
Farming has a plethora of difficult responsibilities, and plant monitoring is one of them. There is also an urgent need to increase the number of alternative techniques for detecting plant diseases, which is now lacking. The agriculture and agricultural support sectors in India provide employment for the great majority of the country's people. In India, the agricultural production of the country is directly connected to the country's economic growth rate. In order to sustain healthy plant development, a variety of processes must be followed, including consideration of environmental factors and water supply management for the optimal production of crops. It is inefficient and uncertain in its outcomes to use the traditional method of watering a lawn. The devastation of more than 18% of the world's agricultural produce is caused by disease attacks on an annual basis. Because it is difficult to execute these activities manually, identifying plant diseases is essential to decreasing losses in the agricultural product business. In addition to diagnosing a wide range of plant ailments, our method also includes the identification of infections as a prophylactic step. Below is a detailed description of a farm-based module that includes numerous cloud data centers and data conversion devices for accurately monitoring and managing farm information and environmental elements. This procedure involves imaging the plant's visually obvious signs in order to identify disease. It is recommended that the therapy be used in conjunction with an application to minimize any harm. Increased productivity as a result of the suggested approach would help both the agricultural and irrigation sectors. The plant area module is fitted with a mobile camera that captures images of all of the plants in the area, and all of the plants' information is saved in a database, which is accessible from any computer with Internet access. It is planned to record information on the plant's name, the type of illness that has been afflicted, and an image of the plant. In a wide range of applications, bots are used to collect images of various plants as well as to prevent disease transmission. To ensure that all information given is retained on the Internet, data is collected and stored in cloud storage as it becomes essential to regulate the condition. According to our findings from our research on wide images of healthy and ill fruit and plant leaves, real-time diagnosis of plant leaf diseases may be done with 98.78% accuracy in a laboratory environment. We utilized 40,000 photographs and then analyzed 10,000 photos to construct a DCDM deep learning model, which was then used to train additional models on the data set. Using a cloud-based image diagnostic and classification service, consumers may receive information about their condition in less than a second on average, with the process requiring only 0.349 s on average.
Collapse
Affiliation(s)
- Parasuraman Kumar
- Centre for Information Technology and Engineering, Manonmaniam Sundaranar University, Abishekaptti, Tirunelveli, Tamilnadu, 627 012, India
| | - Srinivasan Raghavendran
- Centre for Information Technology and Engineering, Manonmaniam Sundaranar University, Abishekaptti, Tirunelveli, Tamilnadu, 627 012, India.
- Department of Computer Science and Engineering, Vel Tech Rangarajan Dr, Sagunthala R&D Institute of Science and Technology, Chennai, Tamilnadu, 600 062, India.
| | - Karunagaran Silambarasan
- Centre for Information Technology and Engineering, Manonmaniam Sundaranar University, Abishekaptti, Tirunelveli, Tamilnadu, 627 012, India
| | | | - Nallaperumal Krishnan
- Centre for Information Technology and Engineering, Manonmaniam Sundaranar University, Abishekaptti, Tirunelveli, Tamilnadu, 627 012, India
| |
Collapse
|
6
|
Bera A, Wharton Z, Liu Y, Bessis N, Behera A. SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained Image Categorization. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; PP:6017-6031. [PMID: 36103441 DOI: 10.1109/tip.2022.3205215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Over the past few years, a significant progress has been made in deep convolutional neural networks (CNNs)-based image recognition. This is mainly due to the strong ability of such networks in mining discriminative object pose and parts information from texture and shape. This is often inappropriate for fine-grained visual classification (FGVC) since it exhibits high intra-class and low inter-class variances due to occlusions, deformation, illuminations, etc. Thus, an expressive feature representation describing global structural information is a key to characterize an object/ scene. To this end, we propose a method that effectively captures subtle changes by aggregating context-aware features from most relevant image-regions and their importance in discriminating fine-grained categories avoiding the bounding-box and/or distinguishable part annotations. Our approach is inspired by the recent advancement in self-attention and graph neural networks (GNNs) approaches to include a simple yet effective relation-aware feature transformation and its refinement using a context-aware attention mechanism to boost the discriminability of the transformed feature in an end-to-end learning process. Our model is evaluated on eight benchmark datasets consisting of fine-grained objects and human-object interactions. It outperforms the state-of-the-art approaches by a significant margin in recognition accuracy.
Collapse
|
7
|
Deng W, Marsh J, Gould S, Zheng L. Fine-Grained Classification via Categorical Memory Networks. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4186-4196. [PMID: 35700253 DOI: 10.1109/tip.2022.3181492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Motivated by the desire to exploit patterns shared across classes, we present a simple yet effective class-specific memory module for fine-grained feature learning. The memory module stores the prototypical feature representation for each category as a moving average. We hypothesize that the combination of similarities with respect to each category is itself a useful discriminative cue. To detect these similarities, we use attention as a querying mechanism. The attention scores with respect to each class prototype are used as weights to combine prototypes via weighted sum, producing a uniquely tailored response feature representation for a given input. The original and response features are combined to produce an augmented feature for classification. We integrate our class-specific memory module into a standard convolutional neural network, yielding a Categorical Memory Network. Our memory module significantly improves accuracy over baseline CNNs, achieving competitive accuracy with state-of-the-art methods on four benchmarks, including CUB-200-2011, Stanford Cars, FGVC Aircraft, and NABirds.
Collapse
|
8
|
Meng M, Zhang T, Yang W, Zhao J, Zhang Y, Wu F. Diverse Complementary Part Mining for Weakly Supervised Object Localization. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1774-1788. [PMID: 35104217 DOI: 10.1109/tip.2022.3145238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Weakly Supervised Object Localization (WSOL) aims to localize objects with only image-level labels, which has better scalability and practicability than fully supervised methods in the actual deployment. However, a common limitation for available techniques based on classification networks is that they only highlight the most discriminative part of the object, not the entire object. To alleviate this problem, we propose a novel end-to-end part discovery model (PDM) to learn multiple discriminative object parts in a unified network for accurate object localization and classification. The proposed PDM enjoys several merits. First, to the best of our knowledge, it is the first work to directly model diverse and robust object parts by exploiting part diversity, compactness, and importance jointly for WSOL. Second, three effective mechanisms including diversity, compactness, and importance learning mechanisms are designed to learn robust object parts. Therefore, our model can exploit complementary spatial information and local details from the learned object parts, which help to produce precise bounding boxes and discriminate different object categories. Extensive experiments on two standard benchmarks demonstrate that our PDM performs favorably against state-of-the-art WSOL approaches.
Collapse
|
9
|
Han J, Yao X, Cheng G, Feng X, Xu D. P-CNN: Part-Based Convolutional Neural Networks for Fine-Grained Visual Categorization. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:579-590. [PMID: 31398107 DOI: 10.1109/tpami.2019.2933510] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper proposes an end-to-end fine-grained visual categorization system, termed Part-based Convolutional Neural Network (P-CNN), which consists of three modules. The first module is a Squeeze-and-Excitation (SE) block, which learns to recalibrate channel-wise feature responses by emphasizing informative channels and suppressing less useful ones. The second module is a Part Localization Network (PLN) used to locate distinctive object parts, through which a bank of convolutional filters are learned as discriminative part detectors. Thus, a group of informative parts can be discovered by convolving the feature maps with each part detector. The third module is a Part Classification Network (PCN) that has two streams. The first stream classifies each individual object part into image-level categories. The second stream concatenates part features and global feature into a joint feature for the final classification. In order to learn powerful part features and boost the joint feature capability, we propose a Duplex Focal Loss used for metric learning and part classification, which focuses on training hard examples. We further merge PLN and PCN into a unified network for an end-to-end training process via a simple training technique. Comprehensive experiments and comparisons with state-of-the-art methods on three benchmark datasets demonstrate the effectiveness of our proposed method.
Collapse
|
10
|
From the whole to detail: Progressively sampling discriminative parts for fine-grained recognition. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107651] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
11
|
Wang C, Song S, Yang Q, Li X, Huang G. Fine-grained few shot learning with foreground object transformation. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.09.016] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
12
|
Abstract
Classifying fine-grained categories (e.g., bird species, car, and aircraft types) is a crucial problem in image understanding and is difficult due to intra-class and inter-class variance. Most of the existing fine-grained approaches individually utilize various parts and local information of objects to improve the classification accuracy but neglect the mechanism of the feature fusion between the object (global) and object’s parts (local) to reinforce fine-grained features. In this paper, we present a novel framework, namely object–part registration–fusion Net (OR-Net), which considers the mechanism of registration and fusion between an object (global) and its parts’ (local) features for fine-grained classification. Our model learns the fine-grained features from the object of global and local regions and fuses these features with the registration mechanism to reinforce each region’s characteristics in the feature maps. Precisely, OR-Net consists of: (1) a multi-stream feature extraction net, which generates features with global and various local regions of objects; (2) a registration–fusion feature module calculates the dimension and location relationships between global (object) regions and local (parts) regions to generate the registration information and fuses the local features into the global features with registration information to generate the fine-grained feature. Experiments execute symmetric GPU devices with symmetric mini-batch to verify that OR-Net surpasses the state-of-the-art approaches on CUB-200-2011 (Birds), Stanford-Cars, and Stanford-Aircraft datasets.
Collapse
|
13
|
Multi-level dictionary learning for fine-grained images categorization with attention model. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.07.147] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
Liu X, Zhang L, Li T, Wang D, Wang Z. Dual attention guided multi-scale CNN for fine-grained image classification. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.05.040] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
15
|
Abstract
AbstractService robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of time but also to continually adapt to new environments and safely interact with non-expert human users. Nowadays, robots are able to recognize various objects, and quickly plan a collision-free trajectory to grasp a target object in predefined settings. Besides, in most of the cases, there is a reliance on large amounts of training data. Therefore, the knowledge of such robots is fixed after the training phase, and any changes in the environment require complicated, time-consuming, and expensive robot re-programming by human experts. Therefore, these approaches are still too rigid for real-life applications in unstructured environments, where a significant portion of the environment is unknown and cannot be directly sensed or controlled. In such environments, no matter how extensive the training data used for batch learning, a robot will always face new objects. Therefore, apart from batch learning, the robot should be able to continually learn about new object categories and grasp affordances from very few training examples on-site. Moreover, apart from robot self-learning, non-expert users could interactively guide the process of experience acquisition by teaching new concepts, or by correcting insufficient or erroneous concepts. In this way, the robot will constantly learn how to help humans in everyday tasks by gaining more and more experiences without the need for re-programming. In this paper, we review a set of previously published works and discuss advances in service robots from object perception to complex object manipulation and shed light on the current challenges and bottlenecks.
Collapse
|
16
|
Wu X, Chang J, Lai YK, Yang J, Tian Q. BiSPL: Bidirectional Self-Paced Learning for Recognition From Web Data. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:6512-6527. [PMID: 34252026 DOI: 10.1109/tip.2021.3094744] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Deep learning (DL) is inherently subject to the requirement of a large amount of well-labeled data, which is expensive and time-consuming to obtain manually. In order to broaden the reach of DL, leveraging free web data becomes an attractive strategy to alleviate the issue of data scarcity. However, directly utilizing collected web data to train a deep model is ineffective because of the mixed noisy data. To address such problems, we develop a novel bidirectional self-paced learning (BiSPL) framework which reduces the effect of noise by learning from web data in a meaningful order. Technically, the BiSPL framework consists of two essential steps. Relying on distances defined between web samples and labeled source samples, first, the web samples with short distances are sampled and combined to form a new training set. Second, based on the new training set, both easy and hard samples are initially employed to train deep models for higher stability, and hard samples are gradually dropped to reduce the noise as the training progresses. By iteratively alternating such steps, deep models converge to a better solution. We mainly focus on the fine-grained visual classification (FGVC) tasks because their corresponding datasets are generally small and therefore face a more significant data scarcity problem. Experiments conducted on six public FGVC tasks demonstrate that our proposed method outperforms the state-of-the-art approaches. Especially, BiSPL suffices to achieve the highest stable performance when the scale of the well-labeled training set decreases dramatically.
Collapse
|
17
|
Kim T, Hong K, Byun H. The feature generator of hard negative samples for fine-grained image recognition. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.10.032] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
18
|
Zhu L, Fan H, Luo Y, Xu M, Yang Y. Few-Shot Common-Object Reasoning Using Common-Centric Localization Network. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4253-4262. [PMID: 33830923 DOI: 10.1109/tip.2021.3070733] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In the few-shot common-localization task, given few support images without bounding box annotations at each episode, the goal is to localize the common object in the query image of unseen categories. The few-shot common-localization task involves common object reasoning from the given images, predicting the spatial locations of the object with different shapes, sizes, and orientations. In this work, we propose a common-centric localization (CCL) network for few-shot common-localization. The motivation of our common-centric localization network is to learn the common object features by dynamic feature relation reasoning via a graph convolutional network with conditional feature aggregation. First, we propose a local common object region generation pipeline to reduce background noises due to feature misalignment. Each support image predicts more accurate object spatial locations by replacing the query with the images in the support set. Second, we introduce a graph convolutional network with dynamic feature transformation to enforce the common object reasoning. To enhance the discriminability during feature matching and enable a better generalization in unseen scenarios, we leverage a conditional feature encoding function to alter visual features according to the input query adaptively. Third, we introduce a common-centric relation structure to model the correlation between the common features and the query image feature. The generated common features guide the query image feature towards a more common object-related representation. We evaluate our common-centric localization network on four datasets, i.e., CL-VOC-07, CL-VOC-12, CL-COCO, CL-VID. We obtain significant improvements compared to state-of-the-art. Our quantitative results confirm the effectiveness of our network.
Collapse
|
19
|
|
20
|
|
21
|
Liu X, Han Z, Liu YS, Zwicker M. Fine-Grained 3D Shape Classification With Hierarchical Part-View Attention. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:1744-1758. [PMID: 33417547 DOI: 10.1109/tip.2020.3048623] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Fine-grained 3D shape classification is important for shape understanding and analysis, which poses a challenging research problem. However, the studies on the fine-grained 3D shape classification have rarely been explored, due to the lack of fine-grained 3D shape benchmarks. To address this issue, we first introduce a new 3D shape dataset (named FG3D dataset) with fine-grained class labels, which consists of three categories including airplane, car and chair. Each category consists of several subcategories at a fine-grained level. According to our experiments under this fine-grained dataset, we find that state-of-the-art methods are significantly limited by the small variance among subcategories in the same category. To resolve this problem, we further propose a novel fine-grained 3D shape classification method named FG3D-Net to capture the fine-grained local details of 3D shapes from multiple rendered views. Specifically, we first train a Region Proposal Network (RPN) to detect the generally semantic parts inside multiple views under the benchmark of generally semantic part detection. Then, we design a hierarchical part-view attention aggregation module to learn a global shape representation by aggregating generally semantic part features, which preserves the local details of 3D shapes. The part-view attention module hierarchically leverages part-level and view-level attention to increase the discriminability of our features. The part-level attention highlights the important parts in each view while the view-level attention highlights the discriminative views among all the views of the same object. In addition, we integrate a Recurrent Neural Network (RNN) to capture the spatial relationships among sequential views from different viewpoints. Our results under the fine-grained 3D shape dataset show that our method outperforms other state-of-the-art methods. The FG3D dataset is available at https://github.com/liuxinhai/FG3D-Net.
Collapse
|
22
|
Zhang CB, Jiang PT, Hou Q, Wei Y, Han Q, Li Z, Cheng MM. Delving Deep Into Label Smoothing. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5984-5996. [PMID: 34166191 DOI: 10.1109/tip.2021.3089942] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
Label smoothing is an effective regularization tool for deep neural networks (DNNs), which generates soft labels by applying a weighted average between the uniform distribution and the hard label. It is often used to reduce the overfitting problem of training DNNs and further improve classification performance. In this paper, we aim to investigate how to generate more reliable soft labels. We present an Online Label Smoothing (OLS) strategy, which generates soft labels based on the statistics of the model prediction for the target category. The proposed OLS constructs a more reasonable probability distribution between the target categories and non-target categories to supervise DNNs. Experiments demonstrate that based on the same classification models, the proposed approach can effectively improve the classification performance on CIFAR-100, ImageNet, and fine-grained datasets. Additionally, the proposed method can significantly improve the robustness of DNN models to noisy labels compared to current label smoothing approaches. The source code is available at our project page: https://mmcheng.net/ols/.
Collapse
|
23
|
Kasaei SH, Lopes LS, Tome AM. Local-LDA: Open-Ended Learning of Latent Topics for 3D Object Recognition. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2020; 42:2567-2580. [PMID: 31283495 DOI: 10.1109/tpami.2019.2926459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Service robots are expected to be more autonomous and work effectively in human-centric environments. This implies that robots should have special capabilities, such as learning from past experiences and real-time object category recognition. This paper proposes an open-ended 3D object recognition system which concurrently learns both the object categories and the statistical features for encoding objects. In particular, we propose an extension of Latent Dirichlet Allocation to learn structural semantic features (i.e., visual topics), from low-level feature co-occurrences, for each category independently. Moreover, topics in each category are discovered in an unsupervised fashion and are updated incrementally using new object views. In this way, the advantages of both the (hand-crafted) local features and the (learned) structural semantic features have been considered and combined in an efficient way. An extensive set of experiments has been performed to assess the performance of the proposed Local-LDA in terms of descriptiveness, scalability, and computation time. Experimental results show that the overall classification performance obtained with Local-LDA is clearly better than the best performances obtained with the state-of-the-art approaches. Moreover, the best scalability, in terms of number of learned categories, was obtained with the proposed Local-LDA approach, closely followed by a Bag-of-Words (BoW) approach. Concerning computation time, the best result was obtained with BoW, immediately followed by the Local-LDA approach.
Collapse
|
24
|
Zhang R, Huang Y, Pu M, Zhang J, Guan Q, Zou Q, Ling H. Object Discovery From a Single Unlabeled Image by Mining Frequent Itemset With Multi-scale Features. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; PP:8606-8621. [PMID: 32813653 DOI: 10.1109/tip.2020.3015543] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The goal of our work is to discover dominant objects in a very general setting where only a single unlabeled image is given. This is far more challenge than typical colocalization or weakly-supervised localization tasks. To tackle this problem, we propose a simple but effective pattern mining-based method, called Object Location Mining (OLM), which exploits the advantages of data mining and feature representation of pretrained convolutional neural networks (CNNs). Specifically, we first convert the feature maps from a pre-trained CNN model into a set of transactions, and then discovers frequent patterns from transaction database through pattern mining techniques. We observe that those discovered patterns, i.e., co-occurrence highlighted regions, typically hold appearance and spatial consistency. Motivated by this observation, we can easily discover and localize possible objects by merging relevant meaningful patterns. Extensive experiments on a variety of benchmarks demonstrate that OLM achieves competitive localization performance compared with the state-of-the-art methods. We also evaluate our approach compared with unsupervised saliency detection methods and achieves competitive results on seven benchmark datasets. Moreover, we conduct experiments on finegrained classification to show that our proposed method can locate the entire object and parts accurately, which can benefit to improving the classification results significantly.
Collapse
|
25
|
Hu X, Liu J, Ma J, Pan Y, Zhang L. Fine-Grained 3D-Attention Prototypes for Few-Shot Learning. Neural Comput 2020; 32:1664-1684. [PMID: 32687772 DOI: 10.1162/neco_a_01302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
In the real world, a limited number of labeled finely grained images per class can hardly represent the class distribution effectively. Due to the more subtle visual differences in fine-grained images than simple images with obvious objects, that is, there exist smaller interclass and larger intraclass variations. To solve these issues, we propose an end-to-end attention-based model for fine-grained few-shot image classification (AFG) with the recent episode training strategy. It is composed mainly of a feature learning module, an image reconstruction module, and a label distribution module. The feature learning module mainly devises a 3D-Attention mechanism, which considers both the spatial positions and different channel attentions of the image features, in order to learn more discriminative local features to better represent the class distribution. The image reconstruction module calculates the mappings between local features and the original images. It is constrained by a designed loss function as auxiliary supervised information, so that the learning of each local feature does not need extra annotations. The label distribution module is used to predict the label distribution of a given unlabeled sample, and we use the local features to represent the image features for classification. By conducting comprehensive experiments on Mini-ImageNet and three fine-grained data sets, we demonstrate that the proposed model achieves superior performance over the competitors.
Collapse
Affiliation(s)
- Xin Hu
- National Engineering Lab for Big Data Analytics and School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Jun Liu
- National Engineering Lab for Big Data Analytics and School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Jie Ma
- School of Computer Science and Technology and Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Tech. R&D, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Yudai Pan
- School of Computer Science and Technology and Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Tech. R&D, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Lingling Zhang
- Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Tech. R&D, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| |
Collapse
|
26
|
Weakly Supervised Fine-Grained Image Classification via Salient Region Localization and Different Layer Feature Fusion. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10134652] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The fine-grained image classification task is about differentiating between different object classes. The difficulties of the task are large intra-class variance and small inter-class variance. For this reason, improving models’ accuracies on the task heavily relies on discriminative parts’ annotations and regional parts’ annotations. Such delicate annotations’ dependency causes the restriction on models’ practicability. To tackle this issue, a saliency module based on a weakly supervised fine-grained image classification model is proposed by this article. Through our salient region localization module, the proposed model can localize essential regional parts with the use of saliency maps, while only image class annotations are provided. Besides, the bilinear attention module can improve the performance on feature extraction by using higher- and lower-level layers of the network to fuse regional features with global features. With the application of the bilinear attention architecture, we propose the different layer feature fusion module to improve the expression ability of model features. We tested and verified our model on public datasets released specifically for fine-grained image classification. The results of our test show that our proposed model can achieve close to state-of-the-art classification performance on various datasets, while only the least training data are provided. Such a result indicates that the practicality of our model is incredibly improved since fine-grained image datasets are expensive.
Collapse
|
27
|
|
28
|
Bameri F, Pourreza HR, Taherinia AH, Aliabadian M, Mortezapour HR, Abdilzadeh R. TMTCPT: The Tree Method based on the Taxonomic Categorization and the Phylogenetic Tree for fine-grained categorization. Biosystems 2020; 195:104137. [PMID: 32360318 DOI: 10.1016/j.biosystems.2020.104137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 03/28/2020] [Accepted: 03/29/2020] [Indexed: 11/25/2022]
Abstract
Fine-grained categorization is one of the most challenging problems in machine vision. Recently, the presented methods have been based on convolutional neural networks, increasing the accuracy of classification very significantly. Inspired by these methods, we offer a new framework for fine-grained categorization. Our tree method, named "TMTCPT", is based on the taxonomic categorization, phylogenetic tree, and convolutional neural network classifiers. The word "taxonomic" has been derived from "taxonomical categorization" that categorizes objects and visual features and performs a prominent role in this category. It presents a hierarchical categorization that leads to multiple classification levels; the first level includes the general visual features having the lowest similarity level, whereas the other levels include visual features strikingly similar, as they follow top-bottom hierarchy. The phylogenetic tree presents the phylogenetic information of organisms. The convolutional neural network classifiers can classify the categories precisely. In this study, the researchers created a tree to increase classification accuracy and evaluated the effectiveness of the method by examining it on the challenging CUB-200-2011 dataset. The study results demonstrated that the proposed method was efficient and robust. The average classification accuracy of the proposed method was 88.34%, being higher than those of all the previous methods.
Collapse
Affiliation(s)
- Fateme Bameri
- Faculty of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran; Machine Vision Lab, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Hamid-Reza Pourreza
- Faculty of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran; Machine Vision Lab, Ferdowsi University of Mashhad, Mashhad, Iran.
| | - Amir-Hossein Taherinia
- Faculty of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran; Machine Vision Lab, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mansour Aliabadian
- Department of Biology, Faculty of Sciences, Ferdowsi University of Mashhad, Mashhad, Iran
| | | | - Raziyeh Abdilzadeh
- Department of Biology, Faculty of Sciences, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
29
|
Wei XS, Wang P, Liu L, Shen C, Wu J. Piecewise Classifier Mappings: Learning Fine-Grained Learners for Novel Categories With Few Examples. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:6116-6125. [PMID: 31265400 DOI: 10.1109/tip.2019.2924811] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Humans are capable of learning a new fine-grained concept with very little supervision, e.g., few exemplary images for a species of bird, yet our best deep learning systems need hundreds or thousands of labeled examples. In this paper, we try to reduce this gap by studying the fine-grained image recognition problem in a challenging few-shot learning setting, termed few-shot fine-grained recognition (FSFG). The task of FSFG requires the learning systems to build classifiers for the novel fine-grained categories from few examples (only one or less than five). To solve this problem, we propose an end-to-end trainable deep network, which is inspired by the state-of-the-art fine-grained recognition model and is tailored for the FSFG task. Specifically, our network consists of a bilinear feature learning module and a classifier mapping module: while the former encodes the discriminative information of an exemplar image into a feature vector, the latter maps the intermediate feature into the decision boundary of the novel category. The key novelty of our model is a "piecewise mappings" function in the classifier mapping module, which generates the decision boundary via learning a set of more attainable sub-classifiers in a more parameter-economic way. We learn the exemplar-to-classifier mapping based on an auxiliary dataset in a meta-learning fashion, which is expected to be able to generalize to novel categories. By conducting comprehensive experiments on three fine-grained datasets, we demonstrate that the proposed method achieves superior performance over the competing baselines.
Collapse
|
30
|
Pan Y, Xia Y, Shen D. Foreground Fisher Vector: Encoding Class-Relevant Foreground to Improve Image Classification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:4716-4729. [PMID: 30946666 DOI: 10.1109/tip.2019.2908795] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Image classification is an essential and challenging task in computer vision. Despite its prevalence, the combination of the deep convolutional neural network (DCNN) and the Fisher vector (FV) encoding method has limited performance since the class-irrelevant background used in the traditional FV encoding may result in less discriminative image features. In this paper, we propose the foreground FV (fgFV) encoding algorithm and its fast approximation for image classification. We try to separate implicitly the class-relevant foreground from the class-irrelevant background during the encoding process via tuning the weights of the partial gradients corresponding to each Gaussian component under the supervision of image labels and, then, use only those local descriptors extracted from the class-relevant foreground to estimate FVs. We have evaluated our fgFV against the widely used FV and improved FV (iFV) under the combined DCNN-FV framework and also compared them to several state-of-the-art image classification approaches on ten benchmark image datasets for the recognition of fine-grained natural species and artificial manufactures, categorization of course objects, and classification of scenes. Our results indicate that the proposed fgFV encoding algorithm can construct more discriminative image presentations from local descriptors than FV and iFV, and the combined DCNN-fgFV algorithm can improve the performance of image classification.
Collapse
|
31
|
Wu L, Wang Y, Li X, Gao J. Deep Attention-Based Spatially Recursive Networks for Fine-Grained Visual Recognition. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:1791-1802. [PMID: 29993796 DOI: 10.1109/tcyb.2018.2813971] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Fine-grained visual recognition is an important problem in pattern recognition applications. However, it is a challenging task due to the subtle interclass difference and large intraclass variation. Recent visual attention models are able to automatically locate critical object parts and represent them against appearance variations. However, without consideration of spatial dependencies in discriminative feature learning, these methods are underperformed in classifying fine-grained objects. In this paper, we present a deep attention-based spatially recursive model that can learn to attend to critical object parts and encode them into spatially expressive representations. Our network is technically premised on bilinear pooling, enabling local pairwise feature interactions between outputs from two different convolutional neural networks (CNNs) that correspond to distinct region detection and relevant feature extraction. Then, spatial long-short term memory (LSTMs) units are introduced to generate spatially meaningful hidden representations via the long-range dependency on all features in two dimensions. The attention model is leveraged between bilinear outcomes and spatial LSTMs for dynamic selection on varied inputs. Our model, which is composed of two-stream CNN layers, bilinear pooling, and spatial recursive encoding with attention, is end-to-end trainable to serve as the part detector and feature extractor whereby relevant features are localized, extracted, and encoded spatially for recognition purpose. We demonstrate the superiority of our method over two typical fine-grained recognition tasks: fine-grained image classification and person re-identification.
Collapse
|
32
|
Which and How Many Regions to Gaze: Focus Discriminative Regions for Fine-Grained Visual Categorization. Int J Comput Vis 2019. [DOI: 10.1007/s11263-019-01176-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
33
|
Du Y, Wang H, Cui Y, Huang X. Fundamental Visual Concept Learning from Correlated Images and Text. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:3598-3612. [PMID: 30794173 DOI: 10.1109/tip.2019.2899944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Heterogeneous web media consists of many visual concepts, such as objects, scenes and activities, that cannot be semantically decomposed. The task of learning fundamental visual concepts (FVCs) plays an important role in automatically understanding the elements that compose all visual media, as well as in applications of retrieval, annotation, etc. In this paper, we formulate the problem of FVC learning and propose an approach to this problem called neighboring concept distributing (NCD). Our approach models all data using a concept graph, which considers the visual patches in images as nodes and generates the inter-image edges between visual patches in different images and the intra-image edges between visual patches in the same image. The NCD approach distributes semantic information from images to visual patches based on measurements over the concept graph, including fitness, distinctiveness, smoothness and sparseness, without any pre-trained concept detectors or classifiers. We analyze the learnability of the proposed approach and find that, under some conditions, all concepts can be correctly learned with an arbitrarily high probability as the size of the data increases.We demonstrate the performance of the NCD approach using three public datasets. Experimental results show that our approach outperforms state-of-the-art approaches when learning visual concepts from correlated media.
Collapse
|
34
|
Xu J, Wang C, Qi C, Shi C, Xiao B. Unsupervised Semantic-Based Aggregation of Deep Convolutional Features. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:601-611. [PMID: 30281422 DOI: 10.1109/tip.2018.2867104] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In this paper, we propose a simple but effective semantic-based aggregation (SBA) method. The proposed SBA utilizes the discriminative filters of deep convolutional layers as semantic detectors. Moreover, we propose the effective unsupervised strategy to select some semantic detectors to generate the "soft region proposals," which highlight certain discriminative pattern of objects and suppress the noise of background. The final global SBA representation could then be acquired by aggregating the regional representations weighted by the selected "soft region proposals" corresponding to various semantic content. Our unsupervised SBA is easy to generalize and achieves excellent performance on various tasks. We conduct comprehensive experiments and show that our unsupervised SBA outperforms the state-of-the-art unsupervised and supervised aggregation methods on image retrieval, place recognition, and cloud classification.
Collapse
|
35
|
Yang J, Sun X, Lai YK, Zheng L, Cheng MM. Recognition From Web Data: A Progressive Filtering Approach. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:5303-5315. [PMID: 30010575 DOI: 10.1109/tip.2018.2855449] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Leveraging the abundant number of web data is a promising strategy in addressing the problem of data lacking when training convolutional neural networks (CNNs). However, the web images often contain incorrect tags, which may compromise the learned CNN model. To address this problem, this paper focuses on image classification and proposes to iterate between filtering out noisy web labels and fine-tuning the CNN model using the crawled web images. Overall, the proposed method benefits from the growing modeling capability of the learned model to correct labels for web images and learning from such new data to produce a more effective model. Our contribution is two-fold. First, we propose an iterative method that progressively improves the discriminative ability of CNNs and the accuracy of web image selection. This method is beneficial toward selecting high-quality web training images and expanding the training set as the model gets ameliorated. Second, since web images are usually complex and may not be accurately described by a single tag, we propose to assign a web image multiple labels to reduce the impact of hard label assignment. This labeling strategy mines more training samples to improve the CNN model. In the experiments, we crawl 0.5 million web images covering all categories of four public image classification data sets. Compared with the baseline which has no web images for training, we show that the proposed method brings notable improvement. We also report the competitive recognition accuracy compared with the state of the art.
Collapse
|
36
|
Large-Scale Fine-Grained Bird Recognition Based on a Triplet Network and Bilinear Model. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8101906] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The main purpose of fine-grained classification is to distinguish among many subcategories of a single basic category, such as birds or flowers. We propose a model based on a triple network and bilinear methods for fine-grained bird identification. Our proposed model can be trained in an end-to-end manner, which effectively increases the inter-class distance of the network extraction features and improves the accuracy of bird recognition. When experimentally tested on 1096 birds in a custom-built dataset and on Caltech-UCSD (a public bird dataset), the model achieved an accuracy of 88.91% and 85.58%, respectively. The experimental results confirm the high generalization ability of our model in fine-grained image classification. Moreover, our model requires no additional manual annotation information such as object-labeling frames and part-labeling points, which guarantees good versatility and robustness in fine-grained bird recognition.
Collapse
|
37
|
Peng Y, He X, Zhao J. Object-Part Attention Model for Fine-Grained Image Classification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:1487-1500. [PMID: 29990123 DOI: 10.1109/tip.2017.2774041] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Fine-grained image classification is to recognize hundreds of subcategories belonging to the same basic-level category, such as 200 subcategories belonging to the bird, which is highly challenging due to large variance in the same subcategory and small variance among different subcategories. Existing methods generally first locate the objects or parts and then discriminate which subcategory the image belongs to. However, they mainly have two limitations: 1) relying on object or part annotations which are heavily labor consuming; and 2) ignoring the spatial relationships between the object and its parts as well as among these parts, both of which are significantly helpful for finding discriminative parts. Therefore, this paper proposes the object-part attention model (OPAM) for weakly supervised fine-grained image classification and the main novelties are: 1) object-part attention model integrates two level attentions: object-level attention localizes objects of images, and part-level attention selects discriminative parts of object. Both are jointly employed to learn multi-view and multi-scale features to enhance their mutual promotion; and 2) Object-part spatial constraint model combines two spatial constraints: object spatial constraint ensures selected parts highly representative and part spatial constraint eliminates redundancy and enhances discrimination of selected parts. Both are jointly employed to exploit the subtle and local differences for distinguishing the subcategories. Importantly, neither object nor part annotations are used in our proposed approach, which avoids the heavy labor consumption of labeling. Compared with more than ten state-of-the-art methods on four widely-used datasets, our OPAM approach achieves the best performance.
Collapse
|
38
|
Lu H, Cao Z, Xiao Y, Fang Z, Zhu Y. Towards fine-grained maize tassel flowering status recognition: Dataset, theory and practice. Appl Soft Comput 2017. [DOI: 10.1016/j.asoc.2017.02.026] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
39
|
Wei XS, Luo JH, Wu J, Zhou ZH. Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2017; 26:2868-2881. [PMID: 28368819 DOI: 10.1109/tip.2017.2688133] [Citation(s) in RCA: 86] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Deep convolutional neural network models pre-trained for the ImageNet classification task have been successfully adopted to tasks in other domains, such as texture description and object proposal generation, but these tasks require annotations for images in the new domain. In this paper, we focus on a novel and challenging task in the pure unsupervised setting: fine-grained image retrieval. Even with image labels, fine-grained images are difficult to classify, letting alone the unsupervised retrieval task. We propose the selective convolutional descriptor aggregation (SCDA) method. The SCDA first localizes the main object in fine-grained images, a step that discards the noisy background and keeps useful deep descriptors. The selected descriptors are then aggregated and the dimensionality is reduced into a short feature vector using the best practices we found. The SCDA is unsupervised, using no image label or bounding box annotation. Experiments on six fine-grained data sets confirm the effectiveness of the SCDA for fine-grained image retrieval. Besides, visualization of the SCDA features shows that they correspond to visual attributes (even subtle ones), which might explain SCDA's high-mean average precision in fine-grained retrieval. Moreover, on general image retrieval data sets, the SCDA achieves comparable retrieval results with the state-of-the-art general image retrieval approaches.
Collapse
|
40
|
Wang L, Guo S, Huang W, Xiong Y, Qiao Y. Knowledge Guided Disambiguation for Large-Scale Scene Classification With Multi-Resolution CNNs. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2017; 26:2055-2068. [PMID: 28252402 DOI: 10.1109/tip.2017.2675339] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Convolutional neural networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2. Scene categories are often defined by multi-level information, including local objects, global layout, and background environment, thus leading to large intra-class variations. In addition, with the increasing number of scene categories, label ambiguity has become another crucial issue in large-scale classification. This paper focuses on large-scale scene recognition and makes two major contributions to tackle these issues. First, we propose a multi-resolution CNN architecture that captures visual content and structure at multiple levels. The multi-resolution CNNs are composed of coarse resolution CNNs and fine resolution CNNs, which are complementary to each other. Second, we design two knowledge guided disambiguation techniques to deal with the problem of label ambiguity: 1) we exploit the knowledge from the confusion matrix computed on validation data to merge ambiguous classes into a super category and 2) we utilize the knowledge of extra networks to produce a soft label for each image. Then, the super categories or soft labels are employed to guide CNN training on the Places2. We conduct extensive experiments on three large-scale image datasets (ImageNet, Places, and Places2), demonstrating the effectiveness of our approach. Furthermore, our method takes part in two major scene recognition challenges, and achieves the second place at the Places2 challenge in ILSVRC 2015, and the first place at the LSUN challenge in CVPR 2016. Finally, we directly test the learned representations on other scene benchmarks, and obtain the new state-of-the-art results on the MIT Indoor67 (86.7%) and SUN397 (72.0%). We release the code and models at https://github.com/wanglimin/MRCNN-Scene-Recognition.
Collapse
|
41
|
Xu Z, Tao D, Huang S, Zhang Y. Friend or Foe: Fine-Grained Categorization With Weak Supervision. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2017; 26:135-146. [PMID: 27810811 DOI: 10.1109/tip.2016.2621661] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Multi-instance learning (MIL) is widely acknowledged as a fundamental method to solve weakly supervised problems. While MIL is usually effective in standard weakly supervised object recognition tasks, in this paper, we investigate the applicability of MIL on an extreme case of weakly supervised learning on the task of fine-grained visual categorization, in which intra-class variance could be larger than inter-class due to the subtle differences between subordinate categories. For this challenging task, we propose a new method that generalizes the standard multi-instance learning framework, for which a novel multi-task co-localization algorithm is proposed to take advantage of the relationship among fine-grained categories and meanwhile performs as an effective initialization strategy for the non-convex multi-instance objective. The localization results also enable object-level domain-specific fine-tuning of deep neural networks, which significantly boosts the performance. Experimental results on three fine-grained datasets reveal the effectiveness of the proposed method, especially the importance of exploiting inter-class relationships between object categories in weakly supervised fine-grained recognition.
Collapse
|
42
|
Do MN. Action Recognition in Still Images With Minimum Annotation Efforts. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:5479-5490. [PMID: 27608461 DOI: 10.1109/tip.2016.2605305] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We focus on the problem of still image-based human action recognition, which essentially involves making prediction by analyzing human poses and their interaction with objects in the scene. Besides image-level action labels (e.g., riding, phoning), during both training and testing stages, existing works usually require additional input of human bounding boxes to facilitate the characterization of the underlying human-object interactions. We argue that this additional input requirement might severely discourage potential applications and is not very necessary. To this end, a systematic approach was developed in this paper to address this challenging problem of minimum annotation efforts, i.e., to perform recognition in the presence of only image-level action labels in the training stage. Experimental results on three benchmark data sets demonstrate that compared with the state-of-the-art methods that have privileged access to additional human bounding-box annotations, our approach achieves comparable or even superior recognition accuracy using only action annotations in training. Interestingly, as a by-product in many cases, our approach is able to segment out the precise regions of underlying human-object interactions.
Collapse
|
43
|
Zhang Y, Wu J, Cai J. Compact Representation of High-Dimensional Feature Vectors for Large-Scale Image Recognition and Retrieval. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:2407-2419. [PMID: 27046897 DOI: 10.1109/tip.2016.2549360] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
In large-scale visual recognition and image retrieval tasks, feature vectors, such as Fisher vector (FV) or the vector of locally aggregated descriptors (VLAD), have achieved state-of-the-art results. However, the combination of the large numbers of examples and high-dimensional vectors necessitates dimensionality reduction, in order to reduce its storage and CPU costs to a reasonable range. In spite of the popularity of various feature compression methods, this paper shows that the feature (dimension) selection is a better choice for high-dimensional FV/VLAD than the feature (dimension) compression methods, e.g., product quantization. We show that strong correlation among the feature dimensions in the FV and the VLAD may not exist, which renders feature selection a natural choice. We also show that, many dimensions in FV/VLAD are noise. Throwing them away using feature selection is better than compressing them and useful dimensions altogether using feature compression methods. To choose features, we propose an efficient importance sorting algorithm considering both the supervised and unsupervised cases, for visual recognition and image retrieval, respectively. Combining with the 1-bit quantization, feature selection has achieved both higher accuracy and less computational cost than feature compression methods, such as product quantization, on the FV and the VLAD image representations.
Collapse
|