1
|
Nareklishvili M, Geitle M. Deep Ensemble Transformers for Dimensionality Reduction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2091-2102. [PMID: 38294917 DOI: 10.1109/tnnls.2024.3357621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2024]
Abstract
We propose deep ensemble transformers (DETs), a fast, scalable approach for dimensionality reduction problems. This method leverages the power of deep neural networks and employs cascade ensemble techniques as its fundamental feature extraction tool. To handle high-dimensional data, our approach employs a flexible number of intermediate layers sequentially. These layers progressively transform the input data into decision tree predictions. To further enhance prediction performance, the output from the final intermediate layer is fed through a feed-forward neural network architecture for final prediction. We derive an upper bound of the disparity between the generalization error and the empirical error and demonstrate that it converges to zero. This highlights the generalizability of our method to parameter estimation and feature selection problems. In our experimental evaluations, DETs outperform existing models in terms of prediction accuracy, representation learning ability, and computational time. Specifically, the method achieves over 95% accuracy in gene expression data and can be trained on average 50% faster than traditional artificial neural networks (ANNs).
Collapse
|
2
|
Fan J, Huang L, Gong C, You Y, Gan M, Wang Z. KMT-PLL: K-Means Cross-Attention Transformer for Partial Label Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:2789-2800. [PMID: 38194387 DOI: 10.1109/tnnls.2023.3347792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
Partial label learning (PLL) studies the problem of learning instance classification with a set of candidate labels and only one is correct. While recent works have demonstrated that the Vision Transformer (ViT) has achieved good results when training from clean data, its applications to PLL remain limited and challenging. To address this issue, we rethink the relationship between instances and object queries to propose K-means cross-attention transformer for PLL (KMT-PLL), which can continuously learn cluster centers and be used for downstream disambiguation tasks. More specifically, K-means cross-attention as a clustering process can effectively learn the cluster centers to represent label classes. The purpose of this operation is to make the similarity between instances and labels measurable, which can effectively detect noise labels. Furthermore, we propose a new corrected cross entropy formulation, which can assign weights to candidate labels according to the instance-to-label relevance to guide the training of the instance classifier. As the training goes on, the ground-truth label is progressively identified, and the refined labels and cluster centers in turn help to improve the classifier. Simulation results demonstrate the advantage of the KMT-PLL and its suitability for PLL.
Collapse
|
3
|
Liu L, Liu M, Li G, Wu Z, Lin J, Lin L. Road Network-Guided Fine-Grained Urban Traffic Flow Inference. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1119-1132. [PMID: 37922186 DOI: 10.1109/tnnls.2023.3327386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2023]
Abstract
Accurate inference of fine-grained traffic flow from coarse-grained one is an emerging yet crucial problem, which can help greatly reduce the number of the required traffic monitoring sensors for cost savings. In this work, we note that traffic flow has a high correlation with road network, which was either completely ignored or simply treated as an external factor in previous works. To facilitate this problem, we propose a novel road-aware traffic flow magnifier (RATFM) that explicitly exploits the prior knowledge of road networks to fully learn the road-aware spatial distribution of fine-grained traffic flow. Specifically, a multidirectional 1-D convolutional layer is first introduced to extract the semantic feature of the road network. Subsequently, we incorporate the road network feature and coarse-grained flow feature to regularize the short-range spatial distribution modeling of road-relative traffic flow. Furthermore, we take the road network feature as a query to capture the long-range spatial distribution of traffic flow with a transformer architecture. Benefiting from the road-aware inference mechanism, our method can generate high-quality fine-grained traffic flow maps. Extensive experiments on three real-world datasets show that the proposed RATFM outperforms state-of-the-art models under various scenarios. Our code and datasets are released at https://github.com/luimoli/RATFM.
Collapse
|
4
|
Chen G, Wang M, Zhang Q, Yuan L, Yue Y. Full Transformer Framework for Robust Point Cloud Registration With Deep Information Interaction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13368-13382. [PMID: 37163402 DOI: 10.1109/tnnls.2023.3267333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Point cloud registration is an essential technology in computer vision and robotics. Recently, transformer-based methods have achieved advanced performance in point cloud registration by utilizing the advantages of the transformer in order-invariance and modeling dependencies to aggregate information. However, they still suffer from indistinct feature extraction, sensitivity to noise, and outliers, owing to three major limitations: 1) the adoption of CNNs fails to model global relations due to their local receptive fields, resulting in extracted features susceptible to noise; 2) the shallow-wide architecture of transformers and the lack of positional information lead to indistinct feature extraction due to inefficient information interaction; and 3) the insufficient consideration of geometrical compatibility leads to the ambiguous identification of incorrect correspondences. To address the above-mentioned limitations, a novel full transformer network for point cloud registration is proposed, named the deep interaction transformer (DIT), which incorporates: 1) a point cloud structure extractor (PSE) to retrieve structural information and model global relations with the local feature integrator (LFI) and transformer encoders; 2) a deep-narrow point feature transformer (PFT) to facilitate deep information interaction across a pair of point clouds with positional information, such that transformers establish comprehensive associations and directly learn the relative position between points; and 3) a geometric matching-based correspondence confidence evaluation (GMCCE) method to measure spatial consistency and estimate correspondence confidence by the designed triangulated descriptor. Extensive experiments on the ModelNet40, ScanObjectNN, and 3DMatch datasets demonstrate that our method is capable of precisely aligning point clouds, consequently, achieving superior performance compared with state-of-the-art methods. The code is publicly available at https://github.com/CGuangyan-BIT/DIT.
Collapse
|
5
|
Zang S, Tu S, Xu L. Self-Organizing a Latent Hierarchy of Sketch Patterns for Controllable Sketch Synthesis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14506-14518. [PMID: 37279131 DOI: 10.1109/tnnls.2023.3279410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Encoding sketches as Gaussian mixture model (GMM)-distributed latent codes is an effective way to control sketch synthesis. Each Gaussian component represents a specific sketch pattern, and a code randomly sampled from the Gaussian can be decoded to synthesize a sketch with the target pattern. However, existing methods treat the Gaussians as individual clusters, which neglects the relationships between them. For example, the giraffe and horse sketches heading left are related to each other by their face orientation. The relationships between sketch patterns are important messages to reveal cognitive knowledge in sketch data. Thus, it is promising to learn accurate sketch representations by modeling the pattern relationships into a latent structure. In this article, we construct a tree-structured taxonomic hierarchy over the clusters of sketch codes. The clusters with the more specific descriptions of sketch patterns are placed at the lower levels, while the ones with the more general patterns are ranked at the higher levels. The clusters at the same rank relate to each other through the inheritance of features from common ancestors. We propose a hierarchical expectation-maximization (EM)-like algorithm to explicitly learn the hierarchy, jointly with the training of encoder-decoder network. Moreover, the learned latent hierarchy is utilized to regularize sketch codes with structural constraints. Experimental results show that our method significantly improves controllable synthesis performance and obtains effective sketch analogy results.
Collapse
|
6
|
Chen S, Hong Z, Xie G, Peng Q, You X, Ding W, Shao L. GNDAN: Graph Navigated Dual Attention Network for Zero-Shot Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:4516-4529. [PMID: 35507624 DOI: 10.1109/tnnls.2022.3155602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Zero-shot learning (ZSL) tackles the unseen class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a direct embedding is adopted for associating the visual and semantic domains in ZSL. However, most existing ZSL methods focus on learning the embedding from implicit global features or image regions to the semantic space. Thus, they fail to: 1) exploit the appearance relationship priors between various local regions in a single image, which corresponds to the semantic information and 2) learn cooperative global and local features jointly for discriminative feature representations. In this article, we propose the novel graph navigated dual attention network (GNDAN) for ZSL to address these drawbacks. GNDAN employs a region-guided attention network (RAN) and a region-guided graph attention network (RGAT) to jointly learn a discriminative local embedding and incorporate global context for exploiting explicit global embeddings under the guidance of a graph. Specifically, RAN uses soft spatial attention to discover discriminative regions for generating local embeddings. Meanwhile, RGAT employs an attribute-based attention to obtain attribute-based region features, where each attribute focuses on the most relevant image regions. Motivated by the graph neural network (GNN), which is beneficial for structural relationship representations, RGAT further leverages a graph attention network to exploit the relationships between the attribute-based region features for explicit global embedding representations. Based on the self-calibration mechanism, the joint visual embedding learned is matched with the semantic embedding to form the final prediction. Extensive experiments on three benchmark datasets demonstrate that the proposed GNDAN achieves superior performances to the state-of-the-art methods. Our code and trained models are available at https://github.com/shiming-chen/GNDAN.
Collapse
|
7
|
Xu P, Zhu X, Clifton DA. Multimodal Learning With Transformers: A Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:12113-12132. [PMID: 37167049 DOI: 10.1109/tpami.2023.3275156] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and Big Data, Transformer-based multimodal learning has become a hot topic in AI research. This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal Big Data era, (2) a systematic review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two important paradigms, i.e., for multimodal pretraining and for specific multimodal tasks, (4) a summary of the common challenges and designs shared by the multimodal Transformer models and applications, and (5) a discussion of open problems and potential research directions for the community.
Collapse
|
8
|
Wang H, Zhang J, Huang Y, Cai B. FBANet: Transfer Learning for Depression Recognition Using a Feature-Enhanced Bi-Level Attention Network. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1350. [PMID: 37761649 PMCID: PMC10529103 DOI: 10.3390/e25091350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 08/30/2023] [Accepted: 09/14/2023] [Indexed: 09/29/2023]
Abstract
The House-Tree-Person (HTP) sketch test is a psychological analysis technique designed to assess the mental health status of test subjects. Nowadays, there are mature methods for the recognition of depression using the HTP sketch test. However, existing works primarily rely on manual analysis of drawing features, which has the drawbacks of strong subjectivity and low automation. Only a small number of works automatically recognize depression using machine learning and deep learning methods, but their complex data preprocessing pipelines and multi-stage computational processes indicate a relatively low level of automation. To overcome the above issues, we present a novel deep learning-based one-stage approach for depression recognition in HTP sketches, which has a simple data preprocessing pipeline and calculation process with a high accuracy rate. In terms of data, we use a hand-drawn HTP sketch dataset, which contains drawings of normal people and patients with depression. In the model aspect, we design a novel network called Feature-Enhanced Bi-Level Attention Network (FBANet), which contains feature enhancement and bi-level attention modules. Due to the limited size of the collected data, transfer learning is employed, where the model is pre-trained on a large-scale sketch dataset and fine-tuned on the HTP sketch dataset. On the HTP sketch dataset, utilizing cross-validation, FBANet achieves a maximum accuracy of 99.07% on the validation dataset, with an average accuracy of 97.71%, outperforming traditional classification models and previous works. In summary, the proposed FBANet, after pre-training, demonstrates superior performance on the HTP sketch dataset and is expected to be a method for the auxiliary diagnosis of depression.
Collapse
Affiliation(s)
| | | | | | - Bo Cai
- Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China; (H.W.); (J.Z.)
| |
Collapse
|
9
|
Ali S, Aslam N, Kim D, Abbas A, Tufail S, Azhar B. Context awareness based Sketch-DeepNet architecture for hand-drawn sketches classification and recognition in AIoT. PeerJ Comput Sci 2023; 9:e1186. [PMID: 37346539 PMCID: PMC10280188 DOI: 10.7717/peerj-cs.1186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Accepted: 01/17/2023] [Indexed: 06/23/2023]
Abstract
A sketch is a black-and-white, 2-D graphical representation of an object and contains fewer visual details as compared to a colored image. Despite fewer details, humans can recognize a sketch and its context very efficiently and consistently across languages, cultures, and age groups, but it is a difficult task for computers to recognize such low-detail sketches and get context out of them. With the tremendous increase in popularity of IoT devices such as smartphones and smart cameras, etc., it has become more critical to recognize free hand-drawn sketches in computer vision and human-computer interaction in order to build a successful artificial intelligence of things (AIoT) system that can first recognize the sketches and then understand the context of multiple drawings. Earlier models which addressed this problem are scale-invariant feature transform (SIFT) and bag-of-words (BoW). Both SIFT and BoW used hand-crafted features and scale-invariant algorithms to address this issue. But these models are complex and time-consuming due to the manual process of features setup. The deep neural networks (DNNs) performed well with object recognition on many large-scale datasets such as ImageNet and CIFAR-10. However, the DDN approach cannot be carried out for hand-drawn sketches problems. The reason is that the data source is images, and all sketches in the images are, for example, 'birds' instead of their specific category (e.g., 'sparrow'). Some deep learning approaches for sketch recognition problems exist in the literature, but the results are not promising because there is still room for improvement. This article proposed a convolutional neural network (CNN) architecture called Sketch-DeepNet for the sketch recognition task. The proposed Sketch-DeepNet architecture used the TU-Berlin dataset for classification. The experimental results show that the proposed method beats the performance of the state-of-the-art sketch classification methods. The proposed model achieved 95.05% accuracy as compared to existing models DeformNet (62.6%), Sketch-DNN (72.2%), Sketch-a-Net (77.95%), SketchNet (80.42%), Thinning-DNN (74.3%), CNN-PCA-SVM (72.5%), Hybrid-CNN (84.42%), and human recognition accuracy of 73% on the TU-Berlin dataset.
Collapse
Affiliation(s)
- Safdar Ali
- Department of Software Engineering, University of Lahore, Lahore, Punjab, Pakistan
| | - Nouraiz Aslam
- Department of Software Engineering, University of Lahore, Lahore, Punjab, Pakistan
| | - DoHyeun Kim
- Department of Computer Engineering, Jeju National University, Jeju, Jeju, South Korea
| | - Asad Abbas
- Department of Computer Science, University of Central Punjab, Lahore, Punjab, Pakistan
| | - Sania Tufail
- Department of Software Engineering, University of Lahore, Lahore, Punjab, Pakistan
| | - Beenish Azhar
- Department of Software Engineering, University of Lahore, Lahore, Punjab, Pakistan
| |
Collapse
|
10
|
Xu P, Hospedales TM, Yin Q, Song YZ, Xiang T, Wang L. Deep Learning for Free-Hand Sketch: A Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:285-312. [PMID: 35130149 DOI: 10.1109/tpami.2022.3148853] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Free-hand sketches are highly illustrative, and have been widely used by humans to depict objects or stories from ancient times to the present. The recent prevalence of touchscreen devices has made sketch creation a much easier task than ever and consequently made sketch-oriented applications increasingly popular. The progress of deep learning has immensely benefited free-hand sketch research and applications. This paper presents a comprehensive survey of the deep learning techniques oriented at free-hand sketch data, and the applications that they enable. The main contents of this survey include: (i) A discussion of the intrinsic traits and unique challenges of free-hand sketch, to highlight the essential differences between sketch data and other data modalities, e.g., natural photos. (ii) A review of the developments of free-hand sketch research in the deep learning era, by surveying existing datasets, research topics, and the state-of-the-art methods through a detailed taxonomy and experimental evaluation. (iii) Promotion of future work via a discussion of bottlenecks, open problems, and potential research directions for the community.
Collapse
|
11
|
Li H, Jiang X, Guan B, Wang R, Thalmann NM. Multistage Spatio-Temporal Networks for Robust Sketch Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2683-2694. [PMID: 35320102 DOI: 10.1109/tip.2022.3160240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Sketch recognition relies on two types of information, namely, spatial contexts like the local structures in images and temporal contexts like the orders of strokes. Existing methods usually adopt convolutional neural networks (CNNs) to model spatial contexts, and recurrent neural networks (RNNs) for temporal contexts. However, most of them combine spatial and temporal features with late fusion or single-stage transformation, which is prone to losing the informative details in sketches. To tackle this problem, we propose a novel framework that aims at the multi-stage interactions and refinements of spatial and temporal features. Specifically, given a sketch represented by a stroke array, we first generate a temporal-enriched image (TEI), which is a pseudo-color image retaining the temporal order of strokes, to overcome the difficulty of CNNs in leveraging temporal information. We then construct a dual-branch network, in which a CNN branch and a RNN branch are adopted to process the stroke array and the TEI respectively. In the early stages of our network, considering the limited ability of RNNs in capturing spatial structures, we utilize multiple enhancement modules to enhance the stroke features with the TEI features. While in the last stage of our network, we propose a spatio-temporal enhancement module that refines stroke features and TEI features in a joint feature space. Furthermore, a bidirectional temporal-compatible unit that adaptively merges features in opposite temporal orders, is proposed to help RNNs tackle abrupt strokes. Comprehensive experimental results on QuickDraw and TU-Berlin demonstrate that the proposed method is a robust and efficient solution for sketch recognition.
Collapse
|