1
|
Wei W, Wei P, Liao Z, Qin J, Cheng X, Liu M, Zheng N. Semantic Consistency Reasoning for 3-D Object Detection in Point Clouds. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3356-3369. [PMID: 38113156 DOI: 10.1109/tnnls.2023.3341097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
Point cloud-based 3-D object detection is a significant and critical issue in numerous applications. While most existing methods attempt to capitalize on the geometric characteristics of point clouds, they neglect the internal semantic properties of point and the consistency between the semantic and geometric clues. We introduce a semantic consistency (SC) mechanism for 3-D object detection in this article, by reasoning about the semantic relations between 3-D object boxes and its internal points. This mechanism is based on a natural principle: the semantic category of a 3-D bounding box should be consistent with the categories of all points within the box. Driven by the SC mechanism, we propose a novel SC network (SCNet) to detect 3-D objects from point clouds. Specifically, the SCNet is composed of a feature extraction module, a detection decision module, and a semantic segmentation module. In inference, the feature extraction and the detection decision modules are used to detect 3-D objects. In training, the semantic segmentation module is jointly trained with the other two modules to produce more robust and applicable model parameters. The performance is greatly boosted through reasoning about the relations between the output 3-D object boxes and segmented points. The proposed SC mechanism is model-agnostic and can be integrated into other base 3-D object detection models. We test the proposed model on three challenging indoor and outdoor benchmark datasets: ScanNetV2, SUN RGB-D, and KITTI. Furthermore, to validate the universality of the SC mechanism, we implement it in three different 3-D object detectors. The experiments show that the performance is impressively improved and the extensive ablation studies also demonstrate the effectiveness of the proposed model.
Collapse
|
2
|
Bi Q, Zhou B, Ji W, Xia GS. Universal Fine-grained Visual Categorization by Concept Guided Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; PP:394-409. [PMID: 40030876 DOI: 10.1109/tip.2024.3523802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Existing fine-grained visual categorization (FGVC) methods assume that the fine-grained semantics rest in the informative parts of an image. This assumption works well on favorable front-view object-centric images, but can face great challenges in many real-world scenarios, such as scene-centric images (e.g., street view) and adverse viewpoint (e.g., object reidentification, remote sensing). In such scenarios, the mis-/over-feature activation is likely to confuse the part selection and degrade the fine-grained representation. In this paper, we are motivated to design a universal FGVC framework for real-world scenarios. More precisely, we propose a concept guided learning (CGL), which models concepts of a certain fine-grained category as a combination of inherited concepts from its subordinate coarse-grained category and discriminative concepts from its own. The discriminative concepts is utilized to guide the fine-grained representation learning. Specifically, three key steps are designed, namely, concept mining, concept fusion, and concept constraint. On the other hand, to bridge the FGVC dataset gap under scene-centric and adverse viewpoint scenarios, a Fine-grained Land-cover Categorization Dataset (FGLCD) with 59,994 fine-grained samples is proposed. Extensive experiments show the proposed CGL: 1) has a competitive performance on conventional FGVC; 2) achieves state-of-the-art performance on fine-grained aerial scenes & scene-centric street scenes; 3) good generalization on object re-identification and fine-grained aerial object detection. The dataset and source code will be available at https://github.com/BiQiWHU/CGL.
Collapse
|
3
|
Guo Y, Du R, Sain A, Liang K, Dong Y, Song YZ, Ma Z. Understanding Episode Hardness in Few-Shot Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:616-633. [PMID: 39378258 DOI: 10.1109/tpami.2024.3476075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2024]
Abstract
Achieving generalization for deep learning models has usually suffered from the bottleneck of annotated sample scarcity. As a common way of tackling this issue, few-shot learning focuses on "episodes", i.e., sampled tasks that help the model acquire generalizable knowledge onto unseen categories - better the episodes, the higher a model's generalisability. Despite extensive research, the characteristics of episodes and their potential effects are relatively less explored. A recent paper discussed that different episodes exhibit different prediction difficulties, and coined a new metric "hardness" to quantify episodes, which however is too wide-range for an arbitrary dataset and thus remains impractical for realistic applications. In this paper therefore, we for the first time conduct an algebraic analysis of the critical factors influencing episode hardness supported by experimental demonstrations, that reveal episode hardness to largely depend on classes within an episode, and importantly propose an efficient pre-sampling hardness assessment technique named Inverse-Fisher Discriminant Ratio (IFDR). This enables sampling hard episodes at the class level via class-level (CL) sampling scheme that drastically decreases quantification cost. Delving deeper, we also develop a variant called class-pair-level (CPL) sampling, which further reduces the sampling cost while guaranteeing the sampled distribution. Finally, comprehensive experiments conducted on benchmark datasets verify the efficacy of our proposed method.
Collapse
|
4
|
Cai D, Chen J, Zhao J, Xue Y, Yang S, Yuan W, Feng M, Weng H, Liu S, Peng Y, Zhu J, Wang K, Jackson C, Tang H, Huang J, Wang X. HiCervix: An Extensive Hierarchical Dataset and Benchmark for Cervical Cytology Classification. IEEE TRANSACTIONS ON MEDICAL IMAGING 2024; 43:4344-4355. [PMID: 38923481 DOI: 10.1109/tmi.2024.3419697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/28/2024]
Abstract
Cervical cytology is a critical screening strategy for early detection of pre-cancerous and cancerous cervical lesions. The challenge lies in accurately classifying various cervical cytology cell types. Existing automated cervical cytology methods are primarily trained on databases covering a narrow range of coarse-grained cell types, which fail to provide a comprehensive and detailed performance analysis that accurately represents real-world cytopathology conditions. To overcome these limitations, we introduce HiCervix, the most extensive, multi-center cervical cytology dataset currently available to the public. HiCervix includes 40,229 cervical cells from 4,496 whole slide images, categorized into 29 annotated classes. These classes are organized within a three-level hierarchical tree to capture fine-grained subtype information. To exploit the semantic correlation inherent in this hierarchical tree, we propose HierSwin, a hierarchical vision transformer-based classification network. HierSwin serves as a benchmark for detailed feature learning in both coarse-level and fine-level cervical cancer classification tasks. In our comprehensive experiments, HierSwin demonstrated remarkable performance, achieving 92.08% accuracy for coarse-level classification and 82.93% accuracy averaged across all three levels. When compared to board-certified cytopathologists, HierSwin achieved high classification performance (0.8293 versus 0.7359 averaged accuracy), highlighting its potential for clinical applications. This newly released HiCervix dataset, along with our benchmark HierSwin method, is poised to make a substantial impact on the advancement of deep learning algorithms for rapid cervical cancer screening and greatly improve cancer prevention and patient outcomes in real-world clinical settings.
Collapse
|
5
|
Ullah MA, Zia T, Kim J, Kadry S. An inherently interpretable deep learning model for local explanations using visual concepts. PLoS One 2024; 19:e0311879. [PMID: 39466770 PMCID: PMC11516011 DOI: 10.1371/journal.pone.0311879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 09/25/2024] [Indexed: 10/30/2024] Open
Abstract
Over the past decade, deep learning has become the leading approach for various computer vision tasks and decision support systems. However, the opaque nature of deep learning models raises significant concerns about their fairness, reliability, and the underlying inferences they make. Many existing methods attempt to approximate the relationship between low-level input features and outcomes. However, humans tend to understand and reason based on high-level concepts rather than low-level input features. To bridge this gap, several concept-based interpretable methods have been developed. Most of these methods compute the importance of each discovered concept for a specific class. However, they often fail to provide local explanations. Additionally, these approaches typically rely on labeled concepts or learn directly from datasets, leading to the extraction of irrelevant concepts. They also tend to overlook the potential of these concepts to interpret model predictions effectively. This research proposes a two-stream model called the Cross-Attentional Fast/Slow Thinking Network (CA-SoftNet) to address these issues. The model is inspired by dual-process theory and integrates two key components: a shallow convolutional neural network (sCNN) as System-I for rapid, implicit pattern recognition and a cross-attentional concept memory network as System-II for transparent, controllable, and logical reasoning. Our evaluation across diverse datasets demonstrates the model's competitive accuracy, achieving 85.6%, 83.7%, 93.6%, and 90.3% on CUB 200-2011, Stanford Cars, ISIC 2016, and ISIC 2017, respectively. This performance outperforms existing interpretable models and is comparable to non-interpretable counterparts. Furthermore, our novel concept extraction method facilitates identifying and selecting salient concepts. These concepts are then used to generate concept-based local explanations that align with human thinking. Additionally, the model's ability to share similar concepts across distinct classes, such as in fine-grained classification, enhances its scalability for large datasets. This feature also induces human-like cognition and reasoning within the proposed framework.
Collapse
Affiliation(s)
- Mirza Ahsan Ullah
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
- Department of Software Engineering, University of Gujrat, Gujrat, Pakistan
| | - Tehseen Zia
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
| | - Jungeun Kim
- Department of Computer Engineering, Inha University, Incheon, Republic of Korea
| | - Seifedine Kadry
- Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon
- Department of Applied Data Science, Noroff University College, Kristiansand, Norway
| |
Collapse
|
6
|
Wu T, Wu W, Yang Y, Fan FL, Zeng T. Retinex Image Enhancement Based on Sequential Decomposition With a Plug-and-Play Framework. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14559-14572. [PMID: 37279121 DOI: 10.1109/tnnls.2023.3280037] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The Retinex model is one of the most representative and effective methods for low-light image enhancement. However, the Retinex model does not explicitly tackle the noise problem and shows unsatisfactory enhancing results. In recent years, due to the excellent performance, deep learning models have been widely used in low-light image enhancement. However, these methods have two limitations. First, the desirable performance can only be achieved by deep learning when a large number of labeled data are available. However, it is not easy to curate massive low-/normal-light paired data. Second, deep learning is notoriously a black-box model. It is difficult to explain their inner working mechanism and understand their behaviors. In this article, using a sequential Retinex decomposition strategy, we design a plug-and-play framework based on the Retinex theory for simultaneous image enhancement and noise removal. Meanwhile, we develop a convolutional neural network-based (CNN-based) denoiser into our proposed plug-and-play framework to generate a reflectance component. The final image is enhanced by integrating the illumination and reflectance with gamma correction. The proposed plug-and-play framework can facilitate both post hoc and ad hoc interpretability. Extensive experiments on different datasets demonstrate that our framework outcompetes the state-of-the-art methods in both image enhancement and denoising.
Collapse
|
7
|
Zhao LJ, Chen ZD, Ma ZX, Luo X, Xu XS. Angular Isotonic Loss Guided Multi-Layer Integration for Few-Shot Fine-Grained Image Classification. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3778-3792. [PMID: 38870000 DOI: 10.1109/tip.2024.3411474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2024]
Abstract
Recent research on few-shot fine-grained image classification (FSFG) has predominantly focused on extracting discriminative features. The limited attention paid to the role of loss functions has resulted in weaker preservation of similarity relationships between query and support instances, thereby potentially limiting the performance of FSFG. In this regard, we analyze the limitations of widely adopted cross-entropy loss and introduce a novel Angular ISotonic (AIS) loss. The AIS loss introduces an angular margin to constrain the prototypes to maintain a certain distance from a pre-set threshold. It guides the model to converge more stably, learn clearer boundaries among highly similar classes, and achieve higher accuracy faster with limited instances. Moreover, to better accommodate the feature requirements of the AIS loss and fully exploit its potential in FSFG, we propose a Multi-Layer Integration (MLI) network that captures object features from multiple perspectives to provide more comprehensive and informative representations of the input images. Extensive experiments demonstrate the effectiveness of our proposed method on four standard fine-grained benchmarks. Codes are available at: https://github.com/Legenddddd/AIS-MLI.
Collapse
|
8
|
Wan L, Zhu W, Dai Y, Zhou G, Chen G, Jiang Y, Zhu M, He M. Identification of Pepper Leaf Diseases Based on TPSAO-AMWNet. PLANTS (BASEL, SWITZERLAND) 2024; 13:1581. [PMID: 38891389 PMCID: PMC11174783 DOI: 10.3390/plants13111581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 05/30/2024] [Accepted: 06/03/2024] [Indexed: 06/21/2024]
Abstract
Pepper is a high-economic-value agricultural crop that faces diverse disease challenges such as blight and anthracnose. These diseases not only reduce the yield of pepper but, in severe cases, can also cause significant economic losses and threaten food security. The timely and accurate identification of pepper diseases is crucial. Image recognition technology plays a key role in this aspect by automating and efficiently identifying pepper diseases, helping agricultural workers to adopt and implement effective control strategies, alleviating the impact of diseases, and being of great importance for improving agricultural production efficiency and promoting sustainable agricultural development. In response to issues such as edge-blurring and the extraction of minute features in pepper disease image recognition, as well as the difficulty in determining the optimal learning rate during the training process of traditional pepper disease identification networks, a new pepper disease recognition model based on the TPSAO-AMWNet is proposed. First, an Adaptive Residual Pyramid Convolution (ARPC) structure combined with a Squeeze-and-Excitation (SE) module is proposed to solve the problem of edge-blurring by utilizing adaptivity and channel attention; secondly, to address the issue of micro-feature extraction, Minor Triplet Disease Focus Attention (MTDFA) is proposed to enhance the capture of local details of pepper leaf disease features while maintaining attention to global features, reducing interference from irrelevant regions; then, a mixed loss function combining Weighted Focal Loss and L2 regularization (WfrLoss) is introduced to refine the learning strategy during dataset processing, enhancing the model's performance and generalization capabilities while preventing overfitting. Subsequently, to tackle the challenge of determining the optimal learning rate, the tent particle snow ablation optimizer (TPSAO) is developed to accurately identify the most effective learning rate. The TPSAO-AMWNet model, trained on our custom datasets, is evaluated against other existing methods. The model attains an average accuracy of 93.52% and an F1 score of 93.15%, demonstrating robust effectiveness and practicality in classifying pepper diseases. These results also offer valuable insights for disease detection in various other crops.
Collapse
Affiliation(s)
- Li Wan
- College of Electronic Information & Physics, Central South University of Forestry and Technology, Changsha 410004, China; (L.W.); (G.Z.)
| | - Wenke Zhu
- College of Bangor, Central South University of Forestry and Technology, Changsha 410004, China; (W.Z.); (Y.D.)
| | - Yixi Dai
- College of Bangor, Central South University of Forestry and Technology, Changsha 410004, China; (W.Z.); (Y.D.)
| | - Guoxiong Zhou
- College of Electronic Information & Physics, Central South University of Forestry and Technology, Changsha 410004, China; (L.W.); (G.Z.)
| | - Guiyun Chen
- College of Computer & Mathematics, Central South University of Forestry and Technology, Changsha 410004, China;
| | - Yichu Jiang
- Hunan Polytechnic of Environment and Biology, Hengyang 421005, China;
| | - Ming’e Zhu
- College of Computer & Mathematics, Central South University of Forestry and Technology, Changsha 410004, China;
| | - Mingfang He
- College of Electronic Information & Physics, Central South University of Forestry and Technology, Changsha 410004, China; (L.W.); (G.Z.)
| |
Collapse
|
9
|
Wang S, Chang J, Wang Z, Li H, Ouyang W, Tian Q. Content-Aware Rectified Activation for Zero-Shot Fine-Grained Image Retrieval. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:4366-4380. [PMID: 38236683 DOI: 10.1109/tpami.2024.3355461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Fine-grained image retrieval mainly focuses on learning salient features from the seen subcategories as discriminative embedding while neglecting the problems behind zero-shot settings. We argue that retrieving fine-grained objects from unseen subcategories may rely on more diverse clues, which are easily restrained by the salient features learnt from seen subcategories. To address this issue, we propose a novel Content-aware Rectified Activation model, which enables this model to suppress the activation on salient regions while preserving their discrimination, and spread activation to adjacent non-salient regions, thus mining more diverse discriminative features for retrieving unseen subcategories. Specifically, we construct a content-aware rectified prototype (CARP) by perceiving semantics of salient regions. CARP acts as a channel-wise non-destructive activation upper bound and can be selectively used to suppress salient regions for obtaining the rectified features. Moreover, two regularizations are proposed: 1) a semantic coherency constraint that imposes a restriction on semantic coherency of CARP and salient regions, aiming at propagating the discriminative ability of salient regions to CARP, 2) a feature-navigated constraint to further guide the model to adaptively balance the discrimination power of rectified features and the suppression power of salient features. Experimental results on fine-grained and product retrieval benchmarks demonstrate that our method consistently outperforms the state-of-the-art methods.
Collapse
|
10
|
Du R, Chang D, Ma Z, Liang K, Song YZ, Guo J. Semi-Supervised Learning for FGVC With Out-of-Category Data. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:2658-2671. [PMID: 37801380 DOI: 10.1109/tpami.2023.3322463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/08/2023]
Abstract
Despite great strides made on fine-grained visual classification (FGVC), current methods are still heavily reliant on fully-supervised paradigms where ample expert labels are called for. Semi-supervised learning (SSL) techniques, acquiring knowledge from unlabeled data, provide a considerable means forward and have shown great promise for coarse-grained problems. However, exiting SSL paradigms mostly assume in-category (i.e., category-aligned) unlabeled data, which hinders their effectiveness when re-proposed on FGVC. In this paper, we put forward a novel design specifically aimed at making out-of-category data work for semi-supervised FGVC. We work off an important assumption that all fine-grained categories naturally follow a hierarchical structure (e.g., the phylogenetic tree of "Aves" that covers all bird species). It follows that, instead of operating on individual samples, we can instead predict sample relations within this tree structure as the optimization goal of SSL. Beyond this, we further introduced two strategies uniquely brought by these tree structures to achieve inter-sample consistency regularization and reliable pseudo-relation. Our experimental results reveal that (i) the proposed method yields good robustness against out-of-category data, and (ii) it can be equipped with prior arts, boosting their performance thus yielding state-of-the-art results.
Collapse
|
11
|
Pu Y, Han Y, Wang Y, Feng J, Deng C, Huang G. Fine-Grained Recognition With Learnable Semantic Data Augmentation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3130-3144. [PMID: 38662557 DOI: 10.1109/tip.2024.3364500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2024]
Abstract
Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. Source code is available at https://github.com/LeapLabTHU/LearnableISDA.
Collapse
|
12
|
Cui S, Hui B. Dual-Dependency Attention Transformer for Fine-Grained Visual Classification. SENSORS (BASEL, SWITZERLAND) 2024; 24:2337. [PMID: 38610547 PMCID: PMC11014298 DOI: 10.3390/s24072337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/31/2024] [Accepted: 04/03/2024] [Indexed: 04/14/2024]
Abstract
Visual transformers (ViTs) are widely used in various visual tasks, such as fine-grained visual classification (FGVC). However, the self-attention mechanism, which is the core module of visual transformers, leads to quadratic computational and memory complexity. The sparse-attention and local-attention approaches currently used by most researchers are not suitable for FGVC tasks. These tasks require dense feature extraction and global dependency modeling. To address this challenge, we propose a dual-dependency attention transformer model. It decouples global token interactions into two paths. The first is a position-dependency attention pathway based on the intersection of two types of grouped attention. The second is a semantic dependency attention pathway based on dynamic central aggregation. This approach enhances the high-quality semantic modeling of discriminative cues while reducing the computational cost to linear computational complexity. In addition, we develop discriminative enhancement strategies. These strategies increase the sensitivity of high-confidence discriminative cue tracking with a knowledge-based representation approach. Experiments on three datasets, NABIRDS, CUB, and DOGS, show that the method is suitable for fine-grained image classification. It finds a balance between computational cost and performance.
Collapse
Affiliation(s)
- Shiyan Cui
- Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences, Shenyang 110016, China;
- Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China
- Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Bin Hui
- Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences, Shenyang 110016, China;
- Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China
- Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China
| |
Collapse
|
13
|
Chang D, Pang K, Du R, Tong Y, Song YZ, Ma Z, Guo J. Making a Bird AI Expert Work for You and Me. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:12068-12084. [PMID: 37159309 DOI: 10.1109/tpami.2023.3274593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
As powerful as fine-grained visual classification (FGVC) is, responding your query with a bird name of "Whip-poor-will" or "Mallard" probably does not make much sense. This however commonly accepted in the literature, underlines a fundamental question interfacing AI and human - what constitutes transferable knowledge for human to learn from AI? This paper sets out to answer this very question using FGVC as a test bed. Specifically, we envisage a scenario where a trained FGVC model (the AI expert) functions as a knowledge provider in enabling average people (you and me) to become better domain experts ourselves. Assuming an AI expert trained using expert human labels, we anchor our focus on asking and providing solutions for two questions: (i) what is the best transferable knowledge we can extract from AI, and (ii) what is the most practical means to measure the gains in expertise given that knowledge? We propose to represent knowledge as highly discriminative visual regions that are expert-exclusive and instantiate it via a novel multi-stage learning framework. A human study of 15,000 trials shows our method is able to consistently improve people of divergent bird expertise to recognise once unrecognisable birds. We further propose a crude but benchmarkable metric TEMI and therefore allow future efforts in this direction to be comparable to ours without the need of large-scale human studies.
Collapse
|
14
|
Yu Y, Wang J. Hybrid Granularities Transformer for Fine-Grained Image Recognition. ENTROPY (BASEL, SWITZERLAND) 2023; 25:e25040601. [PMID: 37190389 PMCID: PMC10137422 DOI: 10.3390/e25040601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Revised: 03/25/2023] [Accepted: 03/30/2023] [Indexed: 05/17/2023]
Abstract
Many current approaches for image classification concentrate solely on the most prominent features within an image, but in fine-grained image recognition, even subtle features can play a significant role in model classification. In addition, the large variations in the same class and small differences between different categories that are unique to fine-grained image recognition pose a great challenge for the model to extract discriminative features between different categories. Therefore, we aim to present two lightweight modules to help the network discover more detailed information in this paper. (1) Patches Hidden Integrator (PHI) module randomly selects patches from images and replaces them with patches from other images of the same class. It allows the network to glean diverse discriminative region information and prevent over-reliance on a single feature, which can lead to misclassification. Additionally, it does not increase the training time. (2) Consistency Feature Learning (CFL) aggregates patch tokens from the last layer, mining local feature information and fusing it with the class token for classification. CFL also utilizes inconsistency loss to force the network to learn common features in both tokens, thereby guiding the network to focus on salient regions. We conducted experiments on three datasets, CUB-200-2011, Stanford Dogs, and Oxford 102 Flowers. We achieved experimental results of 91.6%, 92.7%, and 99.5%, respectively, achieving a competitive performance compared to other works.
Collapse
Affiliation(s)
- Ying Yu
- School of Software, East China Jiaotong University, Nanchang 330013, China
| | - Jinghui Wang
- School of Software, East China Jiaotong University, Nanchang 330013, China
| |
Collapse
|
15
|
Fine-grained image recognition via trusted multi-granularity information fusion. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01685-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
16
|
Chen J, Li H, Liang J, Su X, Zhai Z, Chai X. Attention-based cropping and erasing learning with coarse-to-fine refinement for fine-grained visual classification. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.06.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
17
|
Exploring more concentrated and consistent activation regions for cross-domain semantic segmentation. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
18
|
Guo Y, Du R, Li X, Xie J, Ma Z, Dong Y. Learning Calibrated Class Centers for Few-Shot Classification by Pair-Wise Similarity. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4543-4555. [PMID: 35767479 DOI: 10.1109/tip.2022.3184813] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Metric-based methods achieve promising performance on few-shot classification by learning clusters on support samples and generating shared decision boundaries for query samples. However, existing methods ignore the inaccurate class center approximation introduced by the limited number of support samples, which consequently leads to biased inference. Therefore, in this paper, we propose to reduce the approximation error by class center calibration. Specifically, we introduce the so-called Pair-wise Similarity Module (PSM) to generate calibrated class centers adapted to the query sample by capturing the semantic correlations between the support and the query samples, as well as enhancing the discriminative regions on support representation. It is worth noting that the proposed PSM is a simple plug-and-play module and can be inserted into most metric-based few-shot learning models. Through extensive experiments in metric-based models, we demonstrate that the module significantly improves the performance of conventional few-shot classification methods on four few-shot image classification benchmark datasets. Codes are available at: https://github.com/PRIS-CV/Pair-wise-Similarity-module.
Collapse
|
19
|
Dai D, Tang X, Liu Y, Xia S, Wang G. Multi-granularity association learning for on-the-fly fine-grained sketch-based image retrieval. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
20
|
Liu F, Deng X, Zou C, Lai YK, Chen K, Zuo R, Ma C, Liu YJ, Wang H. SceneSketcher-v2: Fine-Grained Scene-Level Sketch-Based Image Retrieval Using Adaptive GCNs. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3737-3751. [PMID: 35594232 DOI: 10.1109/tip.2022.3175403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Sketch-based image retrieval (SBIR) is a long-standing research topic in computer vision. Existing methods mainly focus on category-level or instance-level image retrieval. This paper investigates the fine-grained scene-level SBIR problem where a free-hand sketch depicting a scene is used to retrieve desired images. This problem is useful yet challenging mainly because of two entangled facts: 1) achieving an effective representation of the input query data and scene-level images is difficult as it requires to model the information across multiple modalities such as object layout, relative size and visual appearances, and 2) there is a great domain gap between the query sketch input and target images. We present SceneSketcher-v2, a Graph Convolutional Network (GCN) based architecture to address these challenges. SceneSketcher-v2 employs a carefully designed graph convolution network to fuse the multi-modality information in the query sketch and target images and uses a triplet training process and end-to-end training manner to alleviate the domain gap. Extensive experiments demonstrate SceneSketcher-v2 outperforms state-of-the-art scene-level SBIR models with a significant margin.
Collapse
|