1
|
Chen K, Li Z, Zhou F, Yu Z. CASF-Net: Underwater Image Enhancement with Color Correction and Spatial Fusion. SENSORS (BASEL, SWITZERLAND) 2025; 25:2574. [PMID: 40285262 PMCID: PMC12030774 DOI: 10.3390/s25082574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/17/2025] [Revised: 03/28/2025] [Accepted: 04/15/2025] [Indexed: 04/29/2025]
Abstract
With the exploration and exploitation of marine resources, underwater images, which serve as crucial carriers of underwater information, significantly influence the advancement of related fields. Despite dozens of underwater image enhancement (UIE) methods being proposed, the impacts of insufficient contrast and distortion of surface texture during UIE are currently underappreciated. To address these challenges, we propose a novel UIE method, channel-adaptive and spatial-fusion Net (CASF-Net), which uses a network channel-adaptive correction module (CACM) to enhance feature extraction and color correction to solve the problem of insufficient contrast. In addition, the CASF-Net utilizes a spatial multi-scale fusion module (SMFM) to solve the surface texture distortion problem and effectively improve underwater image saturation. Furthermore, we propose a Large-scale High-resolution Underwater Image Enhancement Dataset (LHUI), which contains 13,080 pairs of high-resolution images with sufficient diversity for efficient UIE training. Experimental results show that the proposed network design performs well in the UIE task compared with existing methods.
Collapse
Affiliation(s)
- Kai Chen
- Key Laboratory of Ocean Observation and Information of Hainan Province, Sanya Oceanographic Institution, Ocean University of China, Sanya 572000, China; (K.C.); (Z.L.); (F.Z.)
| | - Zhenhao Li
- Key Laboratory of Ocean Observation and Information of Hainan Province, Sanya Oceanographic Institution, Ocean University of China, Sanya 572000, China; (K.C.); (Z.L.); (F.Z.)
| | - Fanting Zhou
- Key Laboratory of Ocean Observation and Information of Hainan Province, Sanya Oceanographic Institution, Ocean University of China, Sanya 572000, China; (K.C.); (Z.L.); (F.Z.)
| | - Zhibin Yu
- Key Laboratory of Ocean Observation and Information of Hainan Province, Sanya Oceanographic Institution, Ocean University of China, Sanya 572000, China; (K.C.); (Z.L.); (F.Z.)
- Faculty of Information Science and Engineering, Ocean University of China, Qingdao 266100, China
| |
Collapse
|
2
|
Sun H, Ding X, Li Z, Sun J, Yu H, Zhang J. Linguistic-visual based multimodal Yi character recognition. Sci Rep 2025; 15:11874. [PMID: 40195531 PMCID: PMC11977249 DOI: 10.1038/s41598-025-96397-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Accepted: 03/27/2025] [Indexed: 04/09/2025] Open
Abstract
The recognition of Yi characters is challenged by considerable variability in their morphological structures and complex semantic relationships, leading to decreased recognition accuracy. This paper presents a multimodal Yi character recognition method comprehensively incorporating linguistic and visual features. The visual transformer, integrated with deformable convolution, effectively captures key features during the visual modeling phase. It effectively adapts to variations in Yi character images, improving recognition accuracy, particularly for images with deformations and complex backgrounds. In the linguistic modeling phase, a Pyramid Pooling Transformer incorporates semantic contextual information across multiple scales, enhancing feature representation and capturing the detailed linguistic structure. Finally, a fusion strategy utilizing the cross-attention mechanism is employed to refine the relationships between feature regions and combine features from different modalities, thereby achieving high-precision character recognition. Experimental results demonstrate that the proposed method achieves a recognition accuracy of 99.5%, surpassing baseline methods by 3.4%, thereby validating its effectiveness.
Collapse
Affiliation(s)
- Haipeng Sun
- Key Laboratory of Ethnic Language Intelligent Analysis and Security Management of MOE, Minzu University of China, Beijing, 100081, China
- School of Chinese Ethnic Minority Languages and Literatures, Minzu University of China, Beijing, 100081, China
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116600, China
| | - Xueyan Ding
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116600, China.
| | - Zimeng Li
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116600, China
| | - Jian Sun
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116600, China
| | - Hua Yu
- Yi Language Research Room, China Ethnic Languages Translation Centre, Beijing, 100080, China
| | - Jianxin Zhang
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116600, China.
| |
Collapse
|
3
|
Zhang X, Xiao Z, Wu X, Chen Y, Zhao J, Hu Y, Liu J. Pyramid Pixel Context Adaption Network for Medical Image Classification With Supervised Contrastive Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:6802-6815. [PMID: 38829749 DOI: 10.1109/tnnls.2024.3399164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2024]
Abstract
Spatial attention (SA) mechanism has been widely incorporated into deep neural networks (DNNs), significantly lifting the performance in computer vision tasks via long-range dependency modeling. However, it may perform poorly in medical image analysis. Unfortunately, the existing efforts are often unaware that long-range dependency modeling has limitations in highlighting subtle lesion regions. To overcome this limitation, we propose a practical yet lightweight architectural unit, pyramid pixel context adaption (PPCA) module, which exploits multiscale pixel context information to recalibrate pixel position in a pixel-independent manner dynamically. PPCA first applies a well-designed cross-channel pyramid pooling (CCPP) to aggregate multiscale pixel context information, then eliminates the inconsistency among them by the well-designed pixel normalization (PN), and finally estimates per pixel attention weight via a pixel context integration. By embedding PPCA into a DNN with negligible overhead, the PPCA network (PPCANet) is developed for medical image classification. In addition, we introduce supervised contrastive learning to enhance feature representation by exploiting the potential of label information via supervised contrastive loss (CL). The extensive experiments on six medical image datasets show that the PPCANet outperforms state-of-the-art (SOTA) attention-based networks and recent DNNs. We also provide visual analysis and ablation study to explain the behavior of PPCANet in the decision-making process.
Collapse
|
4
|
Lagzouli A, Pivonka P, Cooper DML, Sansalone V, Othmani A. A robust deep learning approach for segmenting cortical and trabecular bone from 3D high resolution µCT scans of mouse bone. Sci Rep 2025; 15:8656. [PMID: 40082604 PMCID: PMC11906900 DOI: 10.1038/s41598-025-92954-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Accepted: 03/04/2025] [Indexed: 03/16/2025] Open
Abstract
Recent advancements in deep learning have significantly enhanced the segmentation of high-resolution microcomputed tomography (µCT) bone scans. In this paper, we present the dual-branch attention-based hybrid network (DBAHNet), a deep learning architecture designed for automatically segmenting the cortical and trabecular compartments in 3D µCT scans of mouse tibiae. DBAHNet's hierarchical structure combines transformers and convolutional neural networks to capture long-range dependencies and local features for improved contextual representation. We trained DBAHNet on a limited dataset of 3D µCT scans of mouse tibiae and evaluated its performance on a diverse dataset collected from seven different research studies. This evaluation covered variations in resolutions, ages, mouse strains, drug treatments, surgical procedures, and mechanical loading. DBAHNet demonstrated excellent performance, achieving high accuracy, particularly in challenging scenarios with significantly altered bone morphology. The model's robustness and generalization capabilities were rigorously tested under diverse and unseen conditions, confirming its effectiveness in the automated segmentation of high-resolution µCT mouse tibia scans. Our findings highlight DBAHNet's potential to provide reliable and accurate 3D µCT mouse tibia segmentation, thereby enhancing and accelerating preclinical bone studies in drug development. The model and code are available at https://github.com/bigfahma/DBAHNet .
Collapse
Affiliation(s)
- Amine Lagzouli
- School of Mechanical, Medical, and Process Engineering, Queensland University of Technology, Brisbane, Australia.
- Univ Paris Est Créteil, Univ Gustave Eiffel, CNRS, UMR 8208, MSME, F-94010, Créteil, France.
| | - Peter Pivonka
- School of Mechanical, Medical, and Process Engineering, Queensland University of Technology, Brisbane, Australia
| | - David M L Cooper
- Department of Anatomy, Physiology, and Pharmacology, University of Saskatchewan, Saskatoon, SK, Canada
| | - Vittorio Sansalone
- Univ Paris Est Créteil, Univ Gustave Eiffel, CNRS, UMR 8208, MSME, F-94010, Créteil, France
| | - Alice Othmani
- LISSI, Université Paris-Est Creteil (UPEC), 94400, Vitry sur Seine, France.
| |
Collapse
|
5
|
Zhou X, Jiang Z, Zhou S, Ren Z, Zhang Y, Yu T, Liu Y. Frequency-Assisted Local Attention in Lower Layers of Visual Transformers. Int J Neural Syst 2025:2550015. [PMID: 40016195 DOI: 10.1142/s0129065725500157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2025]
Abstract
Since vision transformers excel at establishing global relationships between features, they play an important role in current vision tasks. However, the global attention mechanism restricts the capture of local features, making convolutional assistance necessary. This paper indicates that transformer-based models can attend to local information without using convolutional blocks, similar to convolutional kernels, by employing a special initialization method. Therefore, this paper proposes a novel hybrid multi-scale model called Frequency-Assisted Local Attention Transformer (FALAT). FALAT introduces a Frequency-Assisted Window-based Positional Self-Attention (FWPSA) module that limits the attention distance of query tokens, enabling the capture of local contents in the early stage. The information from value tokens in the frequency domain enhances information diversity during self-attention computation. Additionally, the traditional convolutional method is replaced with a depth-wise separable convolution to downsample in the spatial reduction attention module for long-distance contents in the later stages. Experimental results demonstrate that FALAT-S achieves 83.0% accuracy on IN-1k with an input size of [Formula: see text] using 29.9[Formula: see text]M parameters and 5.6[Formula: see text]G FLOPs. This model outperforms the Next-ViT-S by 0.9[Formula: see text]APb/0.8[Formula: see text]APm with Mask-R-CNN [Formula: see text] on COCO and surpasses the recent FastViT-SA36 by 3.1% mIoU with FPN on ADE20k.
Collapse
Affiliation(s)
- Xin Zhou
- School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China
| | - Zeyu Jiang
- School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China
| | - Shihua Zhou
- School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China
| | - Zhaohui Ren
- School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China
| | - Yongchao Zhang
- School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China
| | - Tianzhuang Yu
- School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China
| | - Yulin Liu
- School of Mechanical Engineering and Automation, Northeastern University, Wenhua Road, Shen Yang, Liao Ning, P. R. China
| |
Collapse
|
6
|
Zhang T, Chen C, Liu Y, Geng X, Aly MMS, Lin J. PSRR-MaxpoolNMS++: Fast Non-Maximum Suppression With Discretization and Pooling. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:978-993. [PMID: 39466860 DOI: 10.1109/tpami.2024.3485898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/30/2024]
Abstract
Non-maximum suppression (NMS) is an essential post-processing step for object detection. The de-facto standard for NMS, namely GreedyNMS, is not parallelizable and could thus be the performance bottleneck in object detection pipelines. MaxpoolNMS is introduced as a fast and parallelizable alternative to GreedyNMS. However, MaxpoolNMS is only capable of replacing the GreedyNMS at the first stage of two-stage detectors like Faster R-CNN. To address this issue, we observe that MaxpoolNMS employs the process of box coordinate discretization followed by local score argmax calculation, to discard the nested-loop pipeline in GreedyNMS to enable parallelizable implementations. In this paper, we introduce a simple Relationship Recovery module and a Pyramid Shifted MaxpoolNMS module to improve the above two stages, respectively. With these two modules, our PSRR-MaxpoolNMS is a generic and parallelizable approach, which can completely replace GreedyNMS at all stages in all detectors. Furthermore, we extend PSRR-MaxpoolNMS to the more powerful PSRR-MaxpoolNMS++. As for box coordinate discretization, we propose Density-based Discretization for better adherence to the target density of the suppression. As for local score argmax calculation, we propose an Adjacent Scale Pooling scheme for mining out the duplicated box pairs more accurately and efficiently. Extensive experiments demonstrate that both our PSRR-MaxpoolNMS and PSRR-MaxpoolNMS++ outperform MaxpoolNMS by a large margin. Additionally, PSRR-MaxpoolNMS++ not only surpasses PSRR-MaxpoolNMS but also attains competitive accuracy and much better efficiency when compared with GreedyNMS. Therefore, PSRR-MaxpoolNMS++ is a parallelizable NMS solution that can effectively replace GreedyNMS at all stages in all detectors.
Collapse
|
7
|
Wu Z, Wang W, Wang L, Li Y, Lv F, Xia Q, Chen C, Hao A, Li S. Pixel is All You Need: Adversarial Spatio-Temporal Ensemble Active Learning for Salient Object Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:858-877. [PMID: 39383082 DOI: 10.1109/tpami.2024.3476683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/11/2024]
Abstract
Although weakly-supervised techniques can reduce the labeling effort, it is unclear whether a saliency model trained with weakly-supervised data (e.g., point annotation) can achieve the equivalent performance of its fully-supervised version. This paper attempts to answer this unexplored question by proving a hypothesis: there is a point-labeled dataset where saliency models trained on it can achieve equivalent performance when trained on the densely annotated dataset. To prove this conjecture, we proposed a novel yet effective adversarial spatio-temporal ensemble active learning. Our contributions are four-fold: 1) Our proposed adversarial attack triggering uncertainty can conquer the overconfidence of existing active learning methods and accurately locate these uncertain pixels. 2) Our proposed spatio-temporal ensemble strategy not only achieves outstanding performance but significantly reduces the model's computational cost. 3) Our proposed relationship-aware diversity sampling can conquer oversampling while boosting model performance. 4) We provide theoretical proof for the existence of such a point-labeled dataset. Experimental results show that our approach can find such a point-labeled dataset, where a saliency model trained on it obtained 98%-99% performance of its fully-supervised version with only ten annotated points per image.
Collapse
|
8
|
Song X, Tian Y, Liu H, Wang L, Niu J. PPLA-Transformer: An Efficient Transformer for Defect Detection with Linear Attention Based on Pyramid Pooling. SENSORS (BASEL, SWITZERLAND) 2025; 25:828. [PMID: 39943467 PMCID: PMC11820098 DOI: 10.3390/s25030828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2024] [Revised: 01/15/2025] [Accepted: 01/28/2025] [Indexed: 02/16/2025]
Abstract
Defect detection is crucial for quality control in industrial products. The defects in industrial products are typically subtle, leading to reduced accuracy in detection. Furthermore, industrial defect detection often necessitates high efficiency in order to meet operational demands. Deep learning-based algorithms for surface defect detection have been increasingly applied to industrial production processes. Among them, Swin-Transformer achieves remarkable success in many visual tasks. However, the computational burden imposed by numerous image tokens limits the application of Swin-Transformer. To enhance both the detection accuracy and efficiency, this paper proposes a linear attention mechanism based on pyramid pooling. It utilizes a more concise linear attention mechanism to reduce the computational load, thereby improving detection efficiency. Furthermore, it enhances global feature extraction capabilities through pyramid pooling, which improves the detection accuracy. Additionally, the incorporation of partial convolution into the model improves local feature extraction, further enhancing detection precision. Our model demonstrates satisfactory performance with minimal computational cost. It outperforms Swin-Transformer by 1.2% mAP and 52 FPS on the self-constructed SIM card slot defect dataset. When compared to the Swin-Transformer model on the public PKU-Market-PCB dataset, our model achieves an improvement of 1.7% mAP and 51 FPS. These results validate the universality of the proposed approach.
Collapse
Affiliation(s)
| | | | | | - Lijun Wang
- School of Mechanical Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450045, China; (X.S.); (Y.T.); (H.L.); (J.N.)
| | | |
Collapse
|
9
|
Yi J, Liu X, Cheng S, Chen L, Zeng S. Multi-scale window transformer for cervical cytopathology image recognition. Comput Struct Biotechnol J 2024; 24:314-321. [PMID: 38681132 PMCID: PMC11046249 DOI: 10.1016/j.csbj.2024.04.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 04/09/2024] [Accepted: 04/10/2024] [Indexed: 05/01/2024] Open
Abstract
Cervical cancer is a major global health issue, particularly in developing countries where access to healthcare is limited. Early detection of pre-cancerous lesions is crucial for successful treatment and reducing mortality rates. However, traditional screening and diagnostic processes require cytopathology doctors to manually interpret a huge number of cells, which is time-consuming, costly, and prone to human experiences. In this paper, we propose a Multi-scale Window Transformer (MWT) for cervical cytopathology image recognition. We design multi-scale window multi-head self-attention (MW-MSA) to simultaneously integrate cell features of different scales. Small window self-attention is used to extract local cell detail features, and large window self-attention aims to integrate features from smaller-scale window attention to achieve window-to-window information interaction. Our design enables long-range feature integration but avoids whole image self-attention (SA) in ViT or twice local window SA in Swin Transformer. We find convolutional feed-forward networks (CFFN) are more efficient than original MLP-based FFN for representing cytopathology images. Our overall model adopts a pyramid architecture. We establish two multi-center cervical cell classification datasets of two-category 192,123 images and four-category 174,138 images. Extensive experiments demonstrate that our MWT outperforms state-of-the-art general classification networks and specialized classifiers for cytopathology images in the internal and external test sets. The results on large-scale datasets prove the effectiveness and generalization of our proposed model. Our work provides a reliable cytopathology image recognition method and helps establish computer-aided screening for cervical cancer. Our code is available at https://github.com/nmyz669/MWT, and our web service tool can be accessed at https://huggingface.co/spaces/nmyz/MWTdemo.
Collapse
Affiliation(s)
- Jiaxiang Yi
- Britton Chance Center and MoE Key Laboratory for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China
| | - Xiuli Liu
- Britton Chance Center and MoE Key Laboratory for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China
| | - Shenghua Cheng
- School of Biomedical Engineering and Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou, China
| | - Li Chen
- Department of Clinical Laboratory, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Shaoqun Zeng
- Britton Chance Center and MoE Key Laboratory for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, Wuhan, China
| |
Collapse
|
10
|
Zhao X, Xu W. NFMPAtt-Unet: Neighborhood Fuzzy C-means Multi-scale Pyramid Hybrid Attention Unet for medical image segmentation. Neural Netw 2024; 178:106489. [PMID: 38959598 DOI: 10.1016/j.neunet.2024.106489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 05/27/2024] [Accepted: 06/20/2024] [Indexed: 07/05/2024]
Abstract
Medical image segmentation is crucial for understanding anatomical or pathological changes, playing a key role in computer-aided diagnosis and advancing intelligent healthcare. Currently, important issues in medical image segmentation need to be addressed, particularly the problem of segmenting blurry edge regions and the generalizability of segmentation models. Therefore, this study focuses on different medical image segmentation tasks and the issue of blurriness. By addressing these tasks, the study significantly improves diagnostic efficiency and accuracy, contributing to the overall enhancement of healthcare outcomes. To optimize segmentation performance and leverage feature information, we propose a Neighborhood Fuzzy c-Means Multiscale Pyramid Hybrid Attention Unet (NFMPAtt-Unet) model. NFMPAtt-Unet comprises three core components: the Multiscale Dynamic Weight Feature Pyramid module (MDWFP), the Hybrid Weighted Attention mechanism (HWA), and the Neighborhood Rough Set-based Fuzzy c-Means Feature Extraction module (NFCMFE). The MDWFP dynamically adjusts weights across multiple scales, improving feature information capture. The HWA enhances the network's ability to capture and utilize crucial features, while the NFCMFE, grounded in neighborhood rough set concepts, aids in fuzzy C-means feature extraction, addressing complex structures and uncertainties in medical images, thereby enhancing adaptability. Experimental results demonstrate that NFMPAtt-Unet outperforms state-of-the-art models, highlighting its efficacy in medical image segmentation.
Collapse
Affiliation(s)
- Xinpeng Zhao
- College of Artificial Intelligence, Southwest University, Chongqing, 400715, PR China.
| | - Weihua Xu
- College of Artificial Intelligence, Southwest University, Chongqing, 400715, PR China.
| |
Collapse
|
11
|
Peng J, Mei H, Yang R, Meng K, Shi L, Zhao J, Zhang B, Xuan F, Wang T, Zhang T. Olfactory Diagnosis Model for Lung Health Evaluation Based on Pyramid Pooling and SHAP-Based Dual Encoders. ACS Sens 2024; 9:4934-4946. [PMID: 39248698 DOI: 10.1021/acssensors.4c01584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/10/2024]
Abstract
This study introduces a novel deep learning framework for lung health evaluation using exhaled gas. The framework synergistically integrates pyramid pooling and a dual-encoder network, leveraging SHapley Additive exPlanations (SHAP) derived feature importance to enhance its predictive capability. The framework is specifically designed to effectively distinguish between smokers, individuals with chronic obstructive pulmonary disease (COPD), and control subjects. The pyramid pooling structure aggregates multilevel global information by pooling features at four scales. SHAP assesses feature importance from the eight sensors. Two encoder architectures handle different feature sets based on their importance, optimizing performance. Besides, the model's robustness is enhanced using the sliding window technique and white noise augmentation on the original data. In 5-fold cross-validation, the model achieved an average accuracy of 96.40%, surpassing that of a single encoder pyramid pooling model by 10.77%. Further optimization of filters in the transformer convolutional layer and pooling size in the pyramid module increased the accuracy to 98.46%. This study offers an efficient tool for identifying the effects of smoking and COPD, as well as a novel approach to utilizing deep learning technology to address complex biomedical issues.
Collapse
Affiliation(s)
- Jingyi Peng
- Key Lab Intelligent Rehabil & Barrier Free Disable (Ministry of Education), Changchun University, Changchun 130022, China
| | - Haixia Mei
- Key Lab Intelligent Rehabil & Barrier Free Disable (Ministry of Education), Changchun University, Changchun 130022, China
| | - Ruiming Yang
- Key Lab Intelligent Rehabil & Barrier Free Disable (Ministry of Education), Changchun University, Changchun 130022, China
| | - Keyu Meng
- Key Lab Intelligent Rehabil & Barrier Free Disable (Ministry of Education), Changchun University, Changchun 130022, China
| | - Lijuan Shi
- Key Lab Intelligent Rehabil & Barrier Free Disable (Ministry of Education), Changchun University, Changchun 130022, China
| | - Jian Zhao
- Key Lab Intelligent Rehabil & Barrier Free Disable (Ministry of Education), Changchun University, Changchun 130022, China
| | - Bowei Zhang
- Shanghai Key Laboratory of Intelligent Sensing and Detection Technology, School of Mechanical and Power Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Fuzhen Xuan
- Shanghai Key Laboratory of Intelligent Sensing and Detection Technology, School of Mechanical and Power Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Tao Wang
- Shanghai Key Laboratory of Intelligent Sensing and Detection Technology, School of Mechanical and Power Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Tong Zhang
- State Key Laboratory of Integrated Optoelectronics, College of Electronic Science and Engineering, Jilin University, Changchun 130012, China
| |
Collapse
|
12
|
Pei J, Jiang T, Tang H, Liu N, Jin Y, Fan DP, Heng PA. CalibNet: Dual-Branch Cross-Modal Calibration for RGB-D Salient Instance Segmentation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:4348-4362. [PMID: 39074016 DOI: 10.1109/tip.2024.3432328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
In this study, we propose a novel approach for RGB-D salient instance segmentation using a dual-branch cross-modal feature calibration architecture called CalibNet. Our method simultaneously calibrates depth and RGB features in the kernel and mask branches to generate instance-aware kernels and mask features. CalibNet consists of three simple modules, a dynamic interactive kernel (DIK) and a weight-sharing fusion (WSF), which work together to generate effective instance-aware kernels and integrate cross-modal features. To improve the quality of depth features, we incorporate a depth similarity assessment (DSA) module prior to DIK and WSF. In addition, we further contribute a new DSIS dataset, which contains 1,940 images with elaborate instance-level annotations. Extensive experiments on three challenging benchmarks show that CalibNet yields a promising result, i.e., 58.0% AP with 320×480 input size on the COME15K-E test set, which significantly surpasses the alternative frameworks. Our code and dataset will be publicly available at: https://github.com/PJLallen/CalibNet.
Collapse
|
13
|
Hong Q, Liu W, Zhu Y, Ren T, Shi C, Lu Z, Yang Y, Deng R, Qian J, Tan C. CTHNet: a network for wheat ear counting with local-global features fusion based on hybrid architecture. FRONTIERS IN PLANT SCIENCE 2024; 15:1425131. [PMID: 39015290 PMCID: PMC11250278 DOI: 10.3389/fpls.2024.1425131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/18/2024] [Indexed: 07/18/2024]
Abstract
Accurate wheat ear counting is one of the key indicators for wheat phenotyping. Convolutional neural network (CNN) algorithms for counting wheat have evolved into sophisticated tools, however because of the limitations of sensory fields, CNN is unable to simulate global context information, which has an impact on counting performance. In this study, we present a hybrid attention network (CTHNet) for wheat ear counting from RGB images that combines local features and global context information. On the one hand, to extract multi-scale local features, a convolutional neural network is built using the Cross Stage Partial framework. On the other hand, to acquire better global context information, tokenized image patches from convolutional neural network feature maps are encoded as input sequences using Pyramid Pooling Transformer. Then, the feature fusion module merges the local features with the global context information to significantly enhance the feature representation. The Global Wheat Head Detection Dataset and Wheat Ear Detection Dataset are used to assess the proposed model. There were 3.40 and 5.21 average absolute errors, respectively. The performance of the proposed model was significantly better than previous studies.
Collapse
Affiliation(s)
- Qingqing Hong
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Wei Liu
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Yue Zhu
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Tianyu Ren
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Changrong Shi
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Zhixin Lu
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Yunqin Yang
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Ruiting Deng
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Jing Qian
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| | - Changwei Tan
- Jiangsu Key Laboratory of Crop Genetics and Physiology/Jiangsu Key Laboratory of Crop Cultivation and Physiology, Agricultural College of Yangzhou University, Yangzhou, China
- Jiangsu Co-Innovation Center for Modern Production Technology of Grain Crops/Joint International Research Laboratory of Agriculture and Agri-Product Safety of the Ministry of Education of China/Jiangsu Province Engineering Research Center of Knowledge Management and Intelligent Service, College of Information Engineer, Yangzhou University, Yangzhou, China
| |
Collapse
|
14
|
Zhu E, Feng H, Chen L, Lai Y, Chai S. MP-Net: A Multi-Center Privacy-Preserving Network for Medical Image Segmentation. IEEE TRANSACTIONS ON MEDICAL IMAGING 2024; 43:2718-2729. [PMID: 38478456 DOI: 10.1109/tmi.2024.3377248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
In this paper, we present the Multi-Center Privacy-Preserving Network (MP-Net), a novel framework designed for secure medical image segmentation in multi-center collaborations. Our methodology offers a new approach to multi-center collaborative learning, capable of reducing the volume of data transmission and enhancing data privacy protection. Unlike federated learning, which requires the transmission of model data between the central server and local servers in each round, our method only necessitates a single transfer of encrypted data. The proposed MP-Net comprises a three-layer model, consisting of encryption, segmentation, and decryption networks. We encrypt the image data into ciphertext using an encryption network and introduce an improved U-Net for image ciphertext segmentation. Finally, the segmentation mask is obtained through a decryption network. This architecture enables ciphertext-based image segmentation through computable image encryption. We evaluate the effectiveness of our approach on three datasets, including two cardiac MRI datasets and a CTPA dataset. Our results demonstrate that the MP-Net can securely utilize data from multiple centers to establish a more robust and information-rich segmentation model.
Collapse
|
15
|
Liu Y, Wu YH, Zhang SC, Liu L, Wu M, Cheng MM. Revisiting Computer-Aided Tuberculosis Diagnosis. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:2316-2332. [PMID: 37934644 DOI: 10.1109/tpami.2023.3330825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2023]
Abstract
Tuberculosis (TB) is a major global health threat, causing millions of deaths annually. Although early diagnosis and treatment can greatly improve the chances of survival, it remains a major challenge, especially in developing countries. Recently, computer-aided tuberculosis diagnosis (CTD) using deep learning has shown promise, but progress is hindered by limited training data. To address this, we establish a large-scale dataset, namely the Tuberculosis X-ray (TBX11 K) dataset, which contains 11 200 chest X-ray (CXR) images with corresponding bounding box annotations for TB areas. This dataset enables the training of sophisticated detectors for high-quality CTD. Furthermore, we propose a strong baseline, SymFormer, for simultaneous CXR image classification and TB infection area detection. SymFormer incorporates Symmetric Search Attention (SymAttention) to tackle the bilateral symmetry property of CXR images for learning discriminative features. Since CXR images may not strictly adhere to the bilateral symmetry property, we also propose Symmetric Positional Encoding (SPE) to facilitate SymAttention through feature recalibration. To promote future research on CTD, we build a benchmark by introducing evaluation metrics, evaluating baseline models reformed from existing detectors, and running an online challenge. Experiments show that SymFormer achieves state-of-the-art performance on the TBX11 K dataset.
Collapse
|
16
|
Li ZY, Gao S, Cheng MM. SERE: Exploring Feature Self-Relation for Self-Supervised Transformer. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:15619-15631. [PMID: 37647184 DOI: 10.1109/tpami.2023.3309979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Learning representations with self-supervision for convolutional networks (CNN) has been validated to be effective for vision tasks. As an alternative to CNN, vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. Still, most works follow self-supervised strategies designed for CNN, e.g., instance-level discrimination of samples, but they ignore the properties of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks. To enforce this property, we explore the feature SElf-RElation (SERE) for training self-supervised ViT. Specifically, instead of conducting self-supervised learning solely on feature embeddings from multiple views, we utilize the feature self-relations, i.e., spatial/channel self-relations, for self-supervised learning. Self-relation based learning further enhances the relation modeling ability of ViT, resulting in stronger representations that stably improve performance on multiple downstream tasks.
Collapse
|
17
|
Wang J, Quan H, Wang C, Yang G. Pyramid-based self-supervised learning for histopathological image classification. Comput Biol Med 2023; 165:107336. [PMID: 37708715 DOI: 10.1016/j.compbiomed.2023.107336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2022] [Revised: 07/14/2023] [Accepted: 08/07/2023] [Indexed: 09/16/2023]
Abstract
Large-scale labeled datasets are crucial for the success of supervised learning in medical imaging. However, annotating histopathological images is a time-consuming and labor-intensive task that requires highly trained professionals. To address this challenge, self-supervised learning (SSL) can be utilized to pre-train models on large amounts of unsupervised data and transfer the learned representations to various downstream tasks. In this study, we propose a self-supervised Pyramid-based Local Wavelet Transformer (PLWT) model for effectively extracting rich image representations. The PLWT model extracts both local and global features to pre-train a large number of unlabeled histopathology images in a self-supervised manner. Wavelet is used to replace average pooling in the downsampling of the multi-head attention, achieving a significant reduction in information loss during the transmission of image features. Additionally, we introduce a Local Squeeze-and-Excitation (Local SE) module in the feedforward network in combination with the inverse residual to capture local image information. We evaluate PLWT's performance on three histopathological images and demonstrate the impact of pre-training. Our experiment results indicate that PLWT with self-supervised learning performs highly competitive when compared with other SSL methods, and the transferability of visual representations generated by SSL on domain-relevant histopathological images exceeds that of the supervised baseline trained on ImageNet.
Collapse
Affiliation(s)
- Junjie Wang
- Ningbo Artificial Intelligence Institute of Shanghai Jiao Tong University, Zhejiang 315000, PR China; Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, PR China.
| | - Hao Quan
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang 110016, PR China.
| | - Chengguang Wang
- Ningbo Industrial Internet Institute, Zhejiang 315000, PR China.
| | - Genke Yang
- Ningbo Artificial Intelligence Institute of Shanghai Jiao Tong University, Zhejiang 315000, PR China; Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, PR China.
| |
Collapse
|
18
|
Lei X, Cai X, Lu L, Cui Z, Jiang Z. SU 2GE-Net: a saliency-based approach for non-specific class foreground segmentation. Sci Rep 2023; 13:13263. [PMID: 37582948 PMCID: PMC10427708 DOI: 10.1038/s41598-023-40175-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 08/06/2023] [Indexed: 08/17/2023] Open
Abstract
Salient object detection is vital for non-specific class subject segmentation in computer vision applications. However, accurately segmenting foreground subjects with complex backgrounds and intricate boundaries remains a challenge for existing methods. To address these limitations, our study proposes SU2GE-Net, which introduces several novel improvements. We replace the traditional CNN-based backbone with the transformer-based Swin-TransformerV2, known for its effectiveness in capturing long-range dependencies and rich contextual information. To tackle under and over-attention phenomena, we introduce Gated Channel Transformation (GCT). Furthermore, we adopted an edge-based loss (Edge Loss) for network training to capture spatial-wise structural details. Additionally, we propose Training-only Augmentation Loss (TTA Loss) to enhance spatial stability using augmented data. Our method is evaluated using six common datasets, achieving an impressive [Formula: see text] score of 0.883 on DUTS-TE. Compared with other models, SU2GE-Net demonstrates excellent performance in various segmentation scenarios.
Collapse
Affiliation(s)
- Xiaochun Lei
- School of Computer Science and Information Security, Guilin University of Electronic Technology, GuiLin, 541010, Guangxi, China
- Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin, 541004, Guangxi, China
| | - Xiang Cai
- School of Computer Science and Information Security, Guilin University of Electronic Technology, GuiLin, 541010, Guangxi, China
| | - Linjun Lu
- School of Computer Science and Information Security, Guilin University of Electronic Technology, GuiLin, 541010, Guangxi, China
| | - Zihang Cui
- School of Computer Science and Information Security, Guilin University of Electronic Technology, GuiLin, 541010, Guangxi, China
| | - Zetao Jiang
- School of Computer Science and Information Security, Guilin University of Electronic Technology, GuiLin, 541010, Guangxi, China.
- Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin, 541004, Guangxi, China.
| |
Collapse
|
19
|
Zheng Z, Ye R, Hou Q, Ren D, Wang P, Zuo W, Cheng MM. Localization Distillation for Object Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:10070-10083. [PMID: 37027640 DOI: 10.1109/tpami.2023.3248583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the prediction logits due to its inefficiency in distilling the localization information. In this paper, we investigate whether logit mimicking always lags behind feature imitation. Towards this goal, we first present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student. Second, we introduce the concept of valuable localization region that can aid to selectively distill the classification and localization knowledge for a certain region. Combining these two new components, for the first time, we show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking under-performs for years. The thorough studies exhibit the great potential of logit mimicking that can significantly alleviate the localization ambiguity, learn robust feature representation, and ease the training difficulty in the early stage. We also provide the theoretical connection between the proposed LD and the classification KD, that they share the equivalent optimization effect. Our distillation scheme is simple as well as effective and can be easily applied to both dense horizontal object detectors and rotated object detectors. Extensive experiments on the MS COCO, PASCAL VOC, and DOTA benchmarks demonstrate that our method can achieve considerable AP improvement without any sacrifice on the inference speed. Our source code and pretrained models are publicly available at https://github.com/HikariTJU/LD.
Collapse
|
20
|
Li X, Jiang A, Qiu Y, Li M, Zhang X, Yan S. TPFR-Net: U-shaped model for lung nodule segmentation based on transformer pooling and dual-attention feature reorganization. Med Biol Eng Comput 2023:10.1007/s11517-023-02852-9. [PMID: 37243853 DOI: 10.1007/s11517-023-02852-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Accepted: 05/17/2023] [Indexed: 05/29/2023]
Abstract
Accurate segmentation of lung nodules is the key to diagnosing the lesion type of lung nodule. The complex boundaries of lung nodules and the visual similarity to surrounding tissues make precise segmentation of lung nodules challenging. Traditional CNN based lung nodule segmentation models focus on extracting local features from neighboring pixels and ignore global contextual information, which is prone to incomplete segmentation of lung nodule boundaries. In the U-shaped encoder-decoder structure, variations of image resolution caused by up-sampling and down-sampling result in the loss of feature information, which reduces the reliability of output features. This paper proposes transformer pooling module and dual-attention feature reorganization module to effectively improve the above two defects. Transformer pooling module innovatively fuses the self-attention layer and pooling layer in the transformer, which compensates for the limitation of convolution operation, reduces the loss of feature information in the pooling process, and decreases the computational complexity of the Transformer significantly. Dual-attention feature reorganization module innovatively employs the dual-attention mechanism of channel and spatial to improve the sub-pixel convolution, minimizing the loss of feature information during up-sampling. In addition, two convolutional modules are proposed in this paper, which together with transformer pooling module form an encoder that can adequately extract local features and global dependencies. We use the fusion loss function and deep supervision strategy in the decoder to train the model. The proposed model has been extensively experimented and evaluated on the LIDC-IDRI dataset, the highest Dice Similarity Coefficient is 91.84 and the highest sensitivity is 92.66, indicating the model's comprehensive capability has surpassed state-of-the-art UTNet. The model proposed in this paper has superior segmentation performance for lung nodules and can provide a more in-depth assessment of lung nodules' shape, size, and other characteristics, which is of important clinical significance and application value to assist physicians in the early diagnosis of lung nodules.
Collapse
Affiliation(s)
- Xiaotian Li
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030000, China
| | - Ailian Jiang
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030000, China.
| | - Yanfang Qiu
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030000, China
| | - Mengyang Li
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030000, China
| | - Xinyue Zhang
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030000, China
| | - Shuotian Yan
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, 050000, China
| |
Collapse
|
21
|
Nanni L, Fantozzi C, Loreggia A, Lumini A. Ensembles of Convolutional Neural Networks and Transformers for Polyp Segmentation. SENSORS (BASEL, SWITZERLAND) 2023; 23:4688. [PMID: 37430601 DOI: 10.3390/s23104688] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 04/29/2023] [Accepted: 05/09/2023] [Indexed: 07/12/2023]
Abstract
In the realm of computer vision, semantic segmentation is the task of recognizing objects in images at the pixel level. This is done by performing a classification of each pixel. The task is complex and requires sophisticated skills and knowledge about the context to identify objects' boundaries. The importance of semantic segmentation in many domains is undisputed. In medical diagnostics, it simplifies the early detection of pathologies, thus mitigating the possible consequences. In this work, we provide a review of the literature on deep ensemble learning models for polyp segmentation and develop new ensembles based on convolutional neural networks and transformers. The development of an effective ensemble entails ensuring diversity between its components. To this end, we combined different models (HarDNet-MSEG, Polyp-PVT, and HSNet) trained with different data augmentation techniques, optimization methods, and learning rates, which we experimentally demonstrate to be useful to form a better ensemble. Most importantly, we introduce a new method to obtain the segmentation mask by averaging intermediate masks after the sigmoid layer. In our extensive experimental evaluation, the average performance of the proposed ensembles over five prominent datasets beat any other solution that we know of. Furthermore, the ensembles also performed better than the state-of-the-art on two of the five datasets, when individually considered, without having been specifically trained for them.
Collapse
Affiliation(s)
- Loris Nanni
- Department of Information Engineering, University of Padova, 35122 Padova, Italy
| | - Carlo Fantozzi
- Department of Information Engineering, University of Padova, 35122 Padova, Italy
| | - Andrea Loreggia
- Department of Information Engineering, University of Brescia, 25121 Brescia, Italy
| | - Alessandra Lumini
- Department of Computer Science and Engineering, University of Bologna, 40126 Bologna, Italy
| |
Collapse
|
22
|
Pan W, Huang L, Liang J, Hong L, Zhu J. Progressively Hybrid Transformer for Multi-Modal Vehicle Re-Identification. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23094206. [PMID: 37177410 PMCID: PMC10181439 DOI: 10.3390/s23094206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Revised: 04/05/2023] [Accepted: 04/17/2023] [Indexed: 05/15/2023]
Abstract
Multi-modal (i.e., visible, near-infrared, and thermal-infrared) vehicle re-identification has good potential to search vehicles of interest in low illumination. However, due to the fact that different modalities have varying imaging characteristics, a proper multi-modal complementary information fusion is crucial to multi-modal vehicle re-identification. For that, this paper proposes a progressively hybrid transformer (PHT). The PHT method consists of two aspects: random hybrid augmentation (RHA) and a feature hybrid mechanism (FHM). Regarding RHA, an image random cropper and a local region hybrider are designed. The image random cropper simultaneously crops multi-modal images of random positions, random numbers, random sizes, and random aspect ratios to generate local regions. The local region hybrider fuses the cropped regions to let regions of each modal bring local structural characteristics of all modalities, mitigating modal differences at the beginning of feature learning. Regarding the FHM, a modal-specific controller and a modal information embedding are designed to effectively fuse multi-modal information at the feature level. Experimental results show the proposed method wins the state-of-the-art method by a larger 2.7% mAP on RGBNT100 and a larger 6.6% mAP on RGBN300, demonstrating that the proposed method can learn multi-modal complementary information effectively.
Collapse
Affiliation(s)
- Wenjie Pan
- College of Engineering, Huaqiao University, Quanzhou 362021, China
| | - Linhan Huang
- College of Engineering, Huaqiao University, Quanzhou 362021, China
| | - Jianbao Liang
- College of Engineering, Huaqiao University, Quanzhou 362021, China
| | - Lan Hong
- College of Engineering, Huaqiao University, Quanzhou 362021, China
| | - Jianqing Zhu
- College of Engineering, Huaqiao University, Quanzhou 362021, China
- Xiamen Yealink Network Technology Company Limited, No. 666, Hu'an Road, High-Tech Park, Huli District, Xiamen 361015, China
| |
Collapse
|
23
|
Jia Z, You K, He W, Tian Y, Feng Y, Wang Y, Jia X, Lou Y, Zhang J, Li G, Zhang Z. Event-Based Semantic Segmentation With Posterior Attention. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1829-1842. [PMID: 37028052 DOI: 10.1109/tip.2023.3249579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
In the past years, attention-based Transformers have swept across the field of computer vision, starting a new stage of backbones in semantic segmentation. Nevertheless, semantic segmentation under poor light conditions remains an open problem. Moreover, most papers about semantic segmentation work on images produced by commodity frame-based cameras with a limited framerate, hindering their deployment to auto-driving systems that require instant perception and response at milliseconds. An event camera is a new sensor that generates event data at microseconds and can work in poor light conditions with a high dynamic range. It looks promising to leverage event cameras to enable perception where commodity cameras are incompetent, but algorithms for event data are far from mature. Pioneering researchers stack event data as frames so that event-based segmentation is converted to frame-based segmentation, but characteristics of event data are not explored. Noticing that event data naturally highlight moving objects, we propose a posterior attention module that adjusts the standard attention by the prior knowledge provided by event data. The posterior attention module can be readily plugged into many segmentation backbones. Plugging the posterior attention module into a recently proposed SegFormer network, we get EvSegFormer (the event-based version of SegFormer) with state-of-the-art performance in two datasets (MVSEC and DDD-17) collected for event-based segmentation. Code is available at https://github.com/zexiJia/EvSegFormer to facilitate research on event-based vision.
Collapse
|
24
|
Xia Z, Kim J. Enhancing Mask Transformer with Auxiliary Convolution Layers for Semantic Segmentation. SENSORS (BASEL, SWITZERLAND) 2023; 23:581. [PMID: 36679377 PMCID: PMC9867439 DOI: 10.3390/s23020581] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Revised: 12/30/2022] [Accepted: 12/31/2022] [Indexed: 06/17/2023]
Abstract
Transformer-based semantic segmentation methods have achieved excellent performance in recent years. Mask2Former is one of the well-known transformer-based methods which unifies common image segmentation into a universal model. However, it performs relatively poorly in obtaining local features and segmenting small objects due to relying heavily on transformers. To this end, we propose a simple yet effective architecture that introduces auxiliary branches to Mask2Former during training to capture dense local features on the encoder side. The obtained features help improve the performance of learning local information and segmenting small objects. Since the proposed auxiliary convolution layers are required only for training and can be removed during inference, the performance gain can be obtained without additional computation at inference. Experimental results show that our model can achieve state-of-the-art performance (57.6% mIoU) on the ADE20K and (84.8% mIoU) on the Cityscapes datasets.
Collapse
Affiliation(s)
| | - Joohee Kim
- Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA
| |
Collapse
|