1
|
Guo X, Feng Q, Guo F. CMTNet: a hybrid CNN-transformer network for UAV-based hyperspectral crop classification in precision agriculture. Sci Rep 2025; 15:12383. [PMID: 40216979 PMCID: PMC11992135 DOI: 10.1038/s41598-025-97052-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 04/02/2025] [Indexed: 04/14/2025] Open
Abstract
Hyperspectral imaging acquired from unmanned aerial vehicles (UAVs) offers detailed spectral and spatial data that holds transformative potential for precision agriculture applications, such as crop classification, health monitoring, and yield estimation. However, traditional methods struggle to effectively capture both local and global features, particularly in complex agricultural environments with diverse crop types, varying growth stages, and imbalanced data distributions. To address these challenges, we propose CMTNet, an innovative deep learning framework that integrates convolutional neural networks (CNNs) and Transformers for hyperspectral crop classification. The model combines a spectral-spatial feature extraction module to capture shallow features, a dual-branch architecture that extracts both local and global features simultaneously, and a multi-output constraint module to enhance classification accuracy through cross-constraints among multiple feature levels. Extensive experiments were conducted on three UAV-acquired datasets: WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu. The experimental results demonstrate that CMTNet achieved overall accuracy (OA) values of 99.58%, 97.29%, and 98.31% on these three datasets, surpassing the current state-of-the-art method (CTMixer) by 0.19% (LongKou), 1.75% (HanChuan), and 2.52% (HongHu) in OA values, respectively. These findings indicate its superior potential for UAV-based agricultural monitoring in complex environments. These results advance the precision and reliability of hyperspectral crop classification, offering a valuable solution for precision agriculture challenges.
Collapse
Affiliation(s)
- Xihong Guo
- Dingxi Sanniu Agricultural Machinery Manufacturing Co., Ltd., Dingxi, 743000, China
| | - Quan Feng
- College of Mechanical and Electrical Engineering, Gansu Agriculture University, Lanzhou, 730070, China
| | - Faxu Guo
- College of Mechanical and Electrical Engineering, Gansu Agriculture University, Lanzhou, 730070, China.
| |
Collapse
|
2
|
Wan X, Chen F, Gao W, Mo D, Liu H. Fusion of circulant singular spectrum analysis and multiscale local ternary patterns for effective spectral-spatial feature extraction and small sample hyperspectral image classification. Sci Rep 2025; 15:6972. [PMID: 40011533 DOI: 10.1038/s41598-025-90926-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2024] [Accepted: 02/17/2025] [Indexed: 02/28/2025] Open
Abstract
Hyperspectral images (HSIs) contain rich spectral and spatial information, motivating the development of a novel circulant singular spectrum analysis (CiSSA) and multiscale local ternary pattern fusion method for joint spectral-spatial feature extraction and classification. Due to the high dimensionality and redundancy in HSIs, principal component analysis (PCA) is used during preprocessing to reduce dimensionality and enhance computational efficiency. CiSSA is then applied to the PCA-reduced images for robust spatial pattern extraction via circulant matrix decomposition. The spatial features are combined with the global spectral features from PCA to form a unified spectral-spatial feature set (SSFS). Local ternary pattern (LTP) is further applied to the principal components (PCs) to capture local grayscale and rotation-invariant texture features at multiple scales. Finally, the performance of the SSFS and multiscale LTP features is evaluated separately using a support vector machine (SVM), followed by decision-level fusion to combine results from each pipeline based on probability outputs. Experimental results on three popular HSIs show that, under 1% training samples, the proposed method achieves 95.98% accuracy on the Indian Pines dataset, 98.49% on the Pavia University dataset, and 92.28% on the Houston2013 dataset, outperforming several traditional classification methods and state-of-the-art deep learning approaches.
Collapse
Affiliation(s)
- Xiaoqing Wan
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, 421002, China.
- Hunan Provincial Key Laboratory of Intelligent Information Processing and Application, Hengyang, 421002, China.
| | - Feng Chen
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, 421002, China
| | - Weizhe Gao
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, 421002, China
| | - Dongtao Mo
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, 421002, China
| | - Hui Liu
- College of Computer Science and Technology, Hengyang Normal University, Hengyang, 421002, China
| |
Collapse
|
3
|
Domínguez-Cid S, Larios DF, Barbancho J, Molina FJ, Guerra JA, León C. Identification of Olives Using In-Field Hyperspectral Imaging with Lightweight Models. SENSORS (BASEL, SWITZERLAND) 2024; 24:1370. [PMID: 38474904 DOI: 10.3390/s24051370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 02/15/2024] [Accepted: 02/19/2024] [Indexed: 03/14/2024]
Abstract
During the growing season, olives progress through nine different phenological stages, starting with bud development and ending with senescence. During their lifespan, olives undergo changes in their external color and chemical properties. To tackle these properties, we used hyperspectral imaging during the growing season of the olives. The objective of this study was to develop a lightweight model capable of identifying olives in the hyperspectral images using their spectral information. To achieve this goal, we utilized the hyperspectral imaging of olives while they were still on the tree and conducted this process throughout the entire growing season directly in the field without artificial light sources. The images were taken on-site every week from 9:00 to 11:00 a.m. UTC to avoid light saturation and glitters. The data were analyzed using training and testing classifiers, including Decision Tree, Logistic Regression, Random Forest, and Support Vector Machine on labeled datasets. The Logistic Regression model showed the best balance between classification success rate, size, and inference time, achieving a 98% F1-score with less than 1 KB in parameters. A reduction in size was achieved by analyzing the wavelengths that were critical in the decision making, reducing the dimensionality of the hypercube. So, with this novel model, olives in a hyperspectral image can be identified during the season, providing data to enhance a farmer's decision-making process through further automatic applications.
Collapse
Affiliation(s)
- Samuel Domínguez-Cid
- Department of Electronic Technology, Escuela Politecnica Superior, Universidad de Sevilla, 41011 Seville, Spain
| | - Diego Francisco Larios
- Department of Electronic Technology, Escuela Politecnica Superior, Universidad de Sevilla, 41011 Seville, Spain
| | - Julio Barbancho
- Department of Electronic Technology, Escuela Politecnica Superior, Universidad de Sevilla, 41011 Seville, Spain
| | - Francisco Javier Molina
- Department of Electronic Technology, Escuela Politecnica Superior, Universidad de Sevilla, 41011 Seville, Spain
| | - Javier Antonio Guerra
- Department of Electronic Technology, Escuela Politecnica Superior, Universidad de Sevilla, 41011 Seville, Spain
| | - Carlos León
- Department of Electronic Technology, Escuela Politecnica Superior, Universidad de Sevilla, 41011 Seville, Spain
| |
Collapse
|
4
|
Liu S, Yin C, Zhang H. CESA-MCFormer: An Efficient Transformer Network for Hyperspectral Image Classification by Eliminating Redundant Information. SENSORS (BASEL, SWITZERLAND) 2024; 24:1187. [PMID: 38400345 PMCID: PMC10891997 DOI: 10.3390/s24041187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Revised: 02/05/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024]
Abstract
Hyperspectral image (HSI) classification is a highly challenging task, particularly in fields like crop yield prediction and agricultural infrastructure detection. These applications often involve complex image types, such as soil, vegetation, water bodies, and urban structures, encompassing a variety of surface features. In HSI, the strong correlation between adjacent bands leads to redundancy in spectral information, while using image patches as the basic unit of classification causes redundancy in spatial information. To more effectively extract key information from this massive redundancy for classification, we innovatively proposed the CESA-MCFormer model, building upon the transformer architecture with the introduction of the Center Enhanced Spatial Attention (CESA) module and Morphological Convolution (MC). The CESA module combines hard coding and soft coding to provide the model with prior spatial information before the mixing of spatial features, introducing comprehensive spatial information. MC employs a series of learnable pooling operations, not only extracting key details in both spatial and spectral dimensions but also effectively merging this information. By integrating the CESA module and MC, the CESA-MCFormer model employs a "Selection-Extraction" feature processing strategy, enabling it to achieve precise classification with minimal samples, without relying on dimension reduction techniques such as PCA. To thoroughly evaluate our method, we conducted extensive experiments on the IP, UP, and Chikusei datasets, comparing our method with the latest advanced approaches. The experimental results demonstrate that the CESA-MCFormer achieved outstanding performance on all three test datasets, with Kappa coefficients of 96.38%, 98.24%, and 99.53%, respectively.
Collapse
Affiliation(s)
| | - Changqing Yin
- School of Software, Tongji University, Shanghai 201800, China; (S.L.); (H.Z.)
| | | |
Collapse
|
5
|
Wang Q, Zhou B, Zhang J, Xie J, Wang Y. Joint Classification of Hyperspectral Images and LiDAR Data Based on Dual-Branch Transformer. SENSORS (BASEL, SWITZERLAND) 2024; 24:867. [PMID: 38339584 PMCID: PMC10856822 DOI: 10.3390/s24030867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 01/24/2024] [Accepted: 01/24/2024] [Indexed: 02/12/2024]
Abstract
In the face of complex scenarios, the information insufficiency of classification tasks dominated by a single modality has led to a bottleneck in classification performance. The joint application of multimodal remote sensing data for surface observation tasks has garnered widespread attention. However, issues such as sample differences between modalities and the lack of correlation in physical features have limited the performance of classification tasks. Establishing effective interaction between multimodal data has become another significant challenge. To fully integrate heterogeneous information from multiple modalities and enhance classification performance, this paper proposes a dual-branch cross-Transformer feature fusion network aimed at joint land cover classification of hyperspectral imagery (HSI) and Light Detection and Ranging (LiDAR) data. The core idea is to leverage the potential of convolutional operators to represent spatial features, combined with the advantages of the Transformer architecture in learning remote dependencies. The framework employs an improved self-attention mechanism to aggregate features within each modality, highlighting the spectral information of HSI and the spatial (elevation) information of LiDAR. The feature fusion module based on cross-attention integrates deep features from two modalities, achieving complementary information through cross-modal attention. The classification task is performed using jointly obtained spectral and spatial features. Experiments were conducted on three multi-source remote sensing classification datasets, demonstrating the effectiveness of the proposed model compared to existing methods.
Collapse
Affiliation(s)
- Qingyan Wang
- School of Measurement-Control and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China; (B.Z.); (Y.W.)
| | - Binbin Zhou
- School of Measurement-Control and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China; (B.Z.); (Y.W.)
| | - Junping Zhang
- School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China;
| | - Jinbao Xie
- College of Physics and Electronic Engineering, Hainan Normal University, Haikou 571158, China;
| | - Yujing Wang
- School of Measurement-Control and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China; (B.Z.); (Y.W.)
| |
Collapse
|
6
|
Wang T, Xu Z, Hu H, Xu H, Zhao Y, Mao X. Identification of Turtle-Shell Growth Year Using Hyperspectral Imaging Combined with an Enhanced Spatial-Spectral Attention 3DCNN and a Transformer. Molecules 2023; 28:6427. [PMID: 37687257 PMCID: PMC10490299 DOI: 10.3390/molecules28176427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 08/31/2023] [Accepted: 09/01/2023] [Indexed: 09/10/2023] Open
Abstract
Turtle shell (Chinemys reecesii) is a prized traditional Chinese dietary therapy, and the growth year of turtle shell has a significant impact on its quality attributes. In this study, a hyperspectral imaging (HSI) technique combined with a proposed deep learning (DL) network algorithm was investigated for the objective determination of the growth year of turtle shells. The acquisition of hyperspectral images was carried out in the near-infrared range (948.72-2512.97 nm) from samples spanning five different growth years. To fully exploit the spatial and spectral information while reducing redundancy in hyperspectral data simultaneously, three modules were developed. First, the spectral-spatial attention (SSA) module was developed to better protect the spectral correlation among spectral bands and capture fine-grained spatial information of hyperspectral images. Second, the 3D convolutional neural network (CNN), more suitable for the extracted 3D feature map, was employed to facilitate the joint spatial-spectral feature representation. Thirdly, to overcome the constraints of convolution kernels as well as better capture long-range correlation between spectral bands, the transformer encoder (TE) module was further designed. These modules were harmoniously orchestrated, driven by the need to effectively leverage both spatial and spectral information within hyperspectral data. They collectively enhance the model's capacity to extract joint spatial and spectral features to discern growth years accurately. Experimental studies demonstrated that the proposed model (named SSA-3DTE) achieved superior classification accuracy, with 98.94% on average for five-category classification, outperforming traditional machine learning methods using only spectral information and representative deep learning methods. Also, ablation experiments confirmed the effectiveness of each module to improve performance. The encouraging results of this study revealed the potentiality of HSI combined with the DL algorithm as an efficient and non-destructive method for the quality control of turtle shells.
Collapse
Affiliation(s)
- Tingting Wang
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (T.W.); (Z.X.); (H.H.)
| | - Zhenyu Xu
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (T.W.); (Z.X.); (H.H.)
| | - Huiqiang Hu
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (T.W.); (Z.X.); (H.H.)
| | - Huaxing Xu
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (T.W.); (Z.X.); (H.H.)
| | - Yuping Zhao
- China Academy of Chinese Medical Sciences, Beijing 100700, China;
| | - Xiaobo Mao
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (T.W.); (Z.X.); (H.H.)
- Research Center for Intelligent Science and Engineering Technology of Traditional Chinese Medicine, Zhengzhou University, Zhengzhou 450001, China
| |
Collapse
|
7
|
Zhao X, Zhang S, Shi R, Yan W, Pan X. Multi-Temporal Hyperspectral Classification of Grassland Using Transformer Network. SENSORS (BASEL, SWITZERLAND) 2023; 23:6642. [PMID: 37514934 PMCID: PMC10385388 DOI: 10.3390/s23146642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 07/20/2023] [Accepted: 07/22/2023] [Indexed: 07/30/2023]
Abstract
In recent years, grassland monitoring has shifted from traditional field surveys to remote-sensing-based methods, but the desired level of accuracy has not yet been obtained. Multi-temporal hyperspectral data contain valuable information about species and growth season differences, making it a promising tool for grassland classification. Transformer networks can directly extract long-sequence features, which is superior to other commonly used analysis methods. This study aims to explore the transformer network's potential in the field of multi-temporal hyperspectral data by fine-tuning it and introducing it into high-powered grassland detection tasks. Subsequently, the multi-temporal hyperspectral classification of grassland samples using the transformer network (MHCgT) is proposed. To begin, a total of 16,800 multi-temporal hyperspectral data were collected from grassland samples at different growth stages over several years using a hyperspectral imager in the wavelength range of 400-1000 nm. Second, the MHCgT network was established, with a hierarchical architecture, which generates a multi-resolution representation that is beneficial for grass hyperspectral time series' classification. The MHCgT employs a multi-head self-attention mechanism to extract features, avoiding information loss. Finally, an ablation study of MHCgT and comparative experiments with state-of-the-art methods were conducted. The results showed that the proposed framework achieved a high accuracy rate of 98.51% in identifying grassland multi-temporal hyperspectral which outperformed CNN, LSTM-RNN, SVM, RF, and DT by 6.42-26.23%. Moreover, the average classification accuracy of each species was above 95%, and the August mature period was easier to identify than the June growth stage. Overall, the proposed MHCgT framework shows great potential for precisely identifying multi-temporal hyperspectral species and has significant applications in sustainable grassland management and species diversity assessment.
Collapse
Affiliation(s)
- Xuanhe Zhao
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China
| | - Shengwei Zhang
- College of Water Conservancy and Civil Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China
| | - Ruifeng Shi
- Center of Information and Network Technology, Inner Mongolia Agricultural University, Hohhot 010018, China
| | - Weihong Yan
- Institute of Grassland Research of CAAS, Hohhot 010010, China
| | - Xin Pan
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China
| |
Collapse
|
8
|
Xie E, Chen N, Peng J, Sun W, Du Q, You X. Semantic and spatial‐spectral feature fusion transformer network for the classification of hyperspectral image. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2023. [DOI: 10.1049/cit2.12201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023] Open
Affiliation(s)
- Erxin Xie
- Hubei Key Laboratory of Applied Mathematics Faculty of Mathematics and Statistics Hubei University Wuhan China
| | - Na Chen
- Hubei Key Laboratory of Applied Mathematics Faculty of Mathematics and Statistics Hubei University Wuhan China
| | - Jiangtao Peng
- Hubei Key Laboratory of Applied Mathematics Faculty of Mathematics and Statistics Hubei University Wuhan China
| | - Weiwei Sun
- Department of Geography and Spatial Information Techniques Ningbo University Ningbo China
| | - Qian Du
- Department of Electrical and Computer Engineering Mississippi State University Mississippi State Mississippi USA
| | - Xinge You
- School of Electronic Information and Communications Huazhong University of Science and Technology Wuhan China
| |
Collapse
|
9
|
Liao D, Shi C, Wang L. A complementary integrated Transformer network for hyperspectral image classification. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2023. [DOI: 10.1049/cit2.12150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Affiliation(s)
- Diling Liao
- College of Communication and Electronic Engineering Qiqihar University Qiqihar China
| | - Cuiping Shi
- College of Communication and Electronic Engineering Qiqihar University Qiqihar China
| | - Liguo Wang
- College of Information and Communication Engineering Dalian Nationalities University Dalian China
| |
Collapse
|
10
|
Ye F, Zhou Z, Wu Y, Enkhtur B. Application of convolutional neural network in fusion and classification of multi-source remote sensing data. Front Neurorobot 2022; 16:1095717. [PMID: 36620484 PMCID: PMC9815026 DOI: 10.3389/fnbot.2022.1095717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 12/05/2022] [Indexed: 12/24/2022] Open
Abstract
Introduction Through remote sensing images, we can understand and observe the terrain, and its application scope is relatively large, such as agriculture, military, etc. Methods In order to achieve more accurate and efficient multi-source remote sensing data fusion and classification, this study proposes DB-CNN algorithm, introduces SVM algorithm and ELM algorithm, and compares and verifies their performance through relevant experiments. Results From the results, we can find that for the dual branch CNN network structure, hyperspectral data and laser mines joint classification of data can achieve higher classification accuracy. On different data sets, the global classification accuracy of the joint classification method is 98.46%. DB-CNN model has the highest training accuracy and fastest speed in training and testing. In addition, the DB-CNN model has the lowest test error, about 0.026, 0.037 lower than the ELM model and 0.056 lower than the SVM model. The AUC value corresponding to the ROC curve of its model is about 0.922, higher than that of the other two models. Discussion It can be seen that the method used in this paper can significantly improve the effect of multi-source remote sensing data fusion and classification, and has certain practical value.
Collapse
Affiliation(s)
- Fanghong Ye
- Land Satellite Remote Sensing Application Center, Ministry of Natural Resources of People's Republic of China, Beijing, China,School of Resource and Environmental Sciences, Wuhan University, Wuhan, China,*Correspondence: Fanghong Ye ✉
| | - Zheng Zhou
- Ecology and Environment Monitoring and Scientific Research Center, Ministry of Ecology and Environment of the People's Republic of China, Wuhan, China,Zheng Zhou ✉
| | - Yue Wu
- Department of Natural Resources of Heilongjiang Province, Heilongjiang Provincial Institute of Land and Space Planning, Harbin, China
| | - Bayarmaa Enkhtur
- Geospatial Information and Technology Department, Agency for Land Administration and Management, Geodesy and Cartography, Ulaanbaatar, Mongolia
| |
Collapse
|
11
|
Generative Adversarial Networks Based on Transformer Encoder and Convolution Block for Hyperspectral Image Classification. REMOTE SENSING 2022. [DOI: 10.3390/rs14143426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Nowadays, HSI classification can reach a high classification accuracy when given sufficient labeled samples as training set. However, the performances of existing methods decrease sharply when trained on few labeled samples. Existing methods in few-shot problems usually require another dataset in order to improve the classification accuracy. However, the cross-domain problem exists in these methods because of the significant spectral shift between target domain and source domain. Considering above issues, we propose a new method without requiring external dataset through combining a Generative Adversarial Network, Transformer Encoder and convolution block in a unified framework. The proposed method has both a global receptive field provided by Transformer Encoder and a local receptive field provided by convolution block. Experiments conducted on Indian Pines, PaviaU and KSC datasets demonstrate that our method exceeds the results of existing deep learning methods for hyperspectral image classification in the few-shot learning problem.
Collapse
|
12
|
Wide and Deep Fourier Neural Network for Hyperspectral Remote Sensing Image Classification. REMOTE SENSING 2022. [DOI: 10.3390/rs14122931] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Hyperspectral remote sensing image (HSI) classification is very useful in different applications, and recently, deep learning has been applied for HSI classification successfully. However, the number of training samples is usually limited, causing difficulty in use of very deep learning models. We propose a wide and deep Fourier network to learn features efficiently by using pruned features extracted in the frequency domain. It is composed of multiple wide Fourier layers to extract hierarchical features layer-by-layer efficiently. Each wide Fourier layer includes a large number of Fourier transforms to extract features in the frequency domain from a local spatial area using sliding windows with given strides.These extracted features are pruned to retain important features and reduce computations. The weights in the final fully connected layers are computed using least squares. The transform amplitudes are used for nonlinear processing with pruned features. The proposed method was evaluated with HSI datasets including Pavia University, KSC, and Salinas datasets. The overall accuracies (OAs) of the proposed method can reach 99.77%, 99.97%, and 99.95%, respectively. The average accuracies (AAs) can achieve 99.55%, 99.95%, and 99.95%, respectively. The Kappa coefficients are as high as 99.69%, 99.96%, and 99.94%, respectively. The experimental results show that the proposed method achieved excellent performance among other compared methods. The proposed method can be used for applications including classification, and image segmentation tasks, and has the ability to be implemented with lightweight embedded computing platforms. The future work is to improve the method to make it available for use in applications including object detection, time serial data prediction, and fast implementation.
Collapse
|
13
|
Zhang Z, Li T, Tang X, Hu X, Peng Y. CAEVT: Convolutional Autoencoder Meets Lightweight Vision Transformer for Hyperspectral Image Classification. SENSORS (BASEL, SWITZERLAND) 2022; 22:3902. [PMID: 35632310 PMCID: PMC9146051 DOI: 10.3390/s22103902] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Revised: 05/13/2022] [Accepted: 05/18/2022] [Indexed: 01/23/2023]
Abstract
Convolutional neural networks (CNNs) have been prominent in most hyperspectral image (HSI) processing applications due to their advantages in extracting local information. Despite their success, the locality of the convolutional layers within CNNs results in heavyweight models and time-consuming defects. In this study, inspired by the excellent performance of transformers that are used for long-range representation learning in computer vision tasks, we built a lightweight vision transformer for HSI classification that can extract local and global information simultaneously, thereby facilitating accurate classification. Moreover, as traditional dimensionality reduction methods are limited in their linear representation ability, a three-dimensional convolutional autoencoder was adopted to capture the nonlinear characteristics between spectral bands. Based on the aforementioned three-dimensional convolutional autoencoder and lightweight vision transformer, we designed an HSI classification network, namely the "convolutional autoencoder meets lightweight vision transformer" (CAEVT). Finally, we validated the performance of the proposed CAEVT network using four widely used hyperspectral datasets. Our approach showed superiority, especially in the absence of sufficient labeled samples, which demonstrates the effectiveness and efficiency of the CAEVT network.
Collapse
Affiliation(s)
- Zhiwen Zhang
- The State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China; (Z.Z.); (X.T.); (X.H.)
| | - Teng Li
- Beijing Institute for Advanced Study, National University of Defense Technology, Beijing 100020, China;
- College of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410073, China
| | - Xuebin Tang
- The State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China; (Z.Z.); (X.T.); (X.H.)
| | - Xiang Hu
- The State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China; (Z.Z.); (X.T.); (X.H.)
| | - Yuanxi Peng
- The State Key Laboratory of High-Performance Computing, College of Computer, National University of Defense Technology, Changsha 410073, China; (Z.Z.); (X.T.); (X.H.)
| |
Collapse
|
14
|
Multiscale Feature Fusion Network Incorporating 3D Self-Attention for Hyperspectral Image Classification. REMOTE SENSING 2022. [DOI: 10.3390/rs14030742] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
In recent years, the deep learning-based hyperspectral image (HSI) classification method has achieved great success, and the convolutional neural network (CNN) method has achieved good classification performance in the HSI classification task. However, the convolutional operation only works with local neighborhoods, and is effective in extracting local features. It is difficult to capture interactive features over long distances, which affects the accuracy of classification to some extent. At the same time, the data from HSI have the characteristics of three-dimensionality, redundancy, and noise. To solve these problems, we propose a 3D self-attention multiscale feature fusion network (3DSA-MFN) that integrates 3D multi-head self-attention. 3DSA-MFN first uses different sized convolution kernels to extract multiscale features, samples the different granularities of the feature map, and effectively fuses the spatial and spectral features of the feature map. Then, we propose an improved 3D multi-head self-attention mechanism that provides local feature details for the self-attention branch, and fully exploits the context of the input matrix. To verify the performance of the proposed method, we compare it with six current methods on three public datasets. The experimental results show that the proposed 3DSA-MFN achieves competitive classification and highlights the HSI classification task.
Collapse
|
15
|
Object Detection of Road Assets Using Transformer-Based YOLOX with Feature Pyramid Decoder on Thai Highway Panorama. INFORMATION 2021. [DOI: 10.3390/info13010005] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Due to the various sizes of each object, such as kilometer stones, detection is still a challenge, and it directly impacts the accuracy of these object counts. Transformers have demonstrated impressive results in various natural language processing (NLP) and image processing tasks due to long-range modeling dependencies. This paper aims to propose an exceeding you only look once (YOLO) series with two contributions: (i) We propose to employ a pre-training objective to gain the original visual tokens based on the image patches on road asset images. By utilizing pre-training Vision Transformer (ViT) as a backbone, we immediately fine-tune the model weights on downstream tasks by joining task layers upon the pre-trained encoder. (ii) We apply Feature Pyramid Network (FPN) decoder designs to our deep learning network to learn the importance of different input features instead of simply summing up or concatenating, which may cause feature mismatch and performance degradation. Conclusively, our proposed method (Transformer-Based YOLOX with FPN) learns very general representations of objects. It significantly outperforms other state-of-the-art (SOTA) detectors, including YOLOv5S, YOLOv5M, and YOLOv5L. We boosted it to 61.5% AP on the Thailand highway corpus, surpassing the current best practice (YOLOv5L) by 2.56% AP for the test-dev data set.
Collapse
|
16
|
Transformer-Based Decoder Designs for Semantic Segmentation on Remotely Sensed Images. REMOTE SENSING 2021. [DOI: 10.3390/rs13245100] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Transformers have demonstrated remarkable accomplishments in several natural language processing (NLP) tasks as well as image processing tasks. Herein, we present a deep-learning (DL) model that is capable of improving the semantic segmentation network in two ways. First, utilizing the pre-training Swin Transformer (SwinTF) under Vision Transformer (ViT) as a backbone, the model weights downstream tasks by joining task layers upon the pretrained encoder. Secondly, decoder designs are applied to our DL network with three decoder designs, U-Net, pyramid scene parsing (PSP) network, and feature pyramid network (FPN), to perform pixel-level segmentation. The results are compared with other image labeling state of the art (SOTA) methods, such as global convolutional network (GCN) and ViT. Extensive experiments show that our Swin Transformer (SwinTF) with decoder designs reached a new state of the art on the Thailand Isan Landsat-8 corpus (89.8% F1 score), Thailand North Landsat-8 corpus (63.12% F1 score), and competitive results on ISPRS Vaihingen. Moreover, both our best-proposed methods (SwinTF-PSP and SwinTF-FPN) even outperformed SwinTF with supervised pre-training ViT on the ImageNet-1K in the Thailand, Landsat-8, and ISPRS Vaihingen corpora.
Collapse
|
17
|
Memory-Augmented Transformer for Remote Sensing Image Semantic Segmentation. REMOTE SENSING 2021. [DOI: 10.3390/rs13224518] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
The semantic segmentation of remote sensing images requires distinguishing local regions of different classes and exploiting a uniform global representation of the same-class instances. Such requirements make it necessary for the segmentation methods to extract discriminative local features between different classes and to explore representative features for all instances of a given class. While common deep convolutional neural networks (DCNNs) can effectively focus on local features, they are limited by their receptive field to obtain consistent global information. In this paper, we propose a memory-augmented transformer (MAT) to effectively model both the local and global information. The feature extraction pipeline of the MAT is split into a memory-based global relationship guidance module and a local feature extraction module. The local feature extraction module mainly consists of a transformer, which is used to extract features from the input images. The global relationship guidance module maintains a memory bank for the consistent encoding of the global information. Global guidance is performed by memory interaction. Bidirectional information flow between the global and local branches is conducted by a memory-query module, as well as a memory-update module, respectively. Experiment results on the ISPRS Potsdam and ISPRS Vaihingen datasets demonstrated that our method can perform competitively with state-of-the-art methods.
Collapse
|
18
|
Building Extraction from Remote Sensing Images with Sparse Token Transformers. REMOTE SENSING 2021. [DOI: 10.3390/rs13214441] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Deep learning methods have achieved considerable progress in remote sensing image building extraction. Most building extraction methods are based on Convolutional Neural Networks (CNN). Recently, vision transformers have provided a better perspective for modeling long-range context in images, but usually suffer from high computational complexity and memory usage. In this paper, we explored the potential of using transformers for efficient building extraction. We design an efficient dual-pathway transformer structure that learns the long-term dependency of tokens in both their spatial and channel dimensions and achieves state-of-the-art accuracy on benchmark building extraction datasets. Since single buildings in remote sensing images usually only occupy a very small part of the image pixels, we represent buildings as a set of “sparse” feature vectors in their feature space by introducing a new module called “sparse token sampler”. With such a design, the computational complexity in transformers can be greatly reduced over an order of magnitude. We refer to our method as Sparse Token Transformers (STT). Experiments conducted on the Wuhan University Aerial Building Dataset (WHU) and the Inria Aerial Image Labeling Dataset (INRIA) suggest the effectiveness and efficiency of our method. Compared with some widely used segmentation methods and some state-of-the-art building extraction methods, STT has achieved the best performance with low time cost.
Collapse
|