1
|
Xie T, Dai K, Jiang Z, Li R, Mao S, Wang K, Zhao L. ViT-MVT: A Unified Vision Transformer Network for Multiple Vision Tasks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3027-3041. [PMID: 38127606 DOI: 10.1109/tnnls.2023.3342141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
In this work, we seek to learn multiple mainstream vision tasks concurrently using a unified network, which is storage-efficient as numerous networks with task-shared parameters can be implanted into a single consolidated network. Our framework, vision transformer (ViT)-MVT, built on a plain and nonhierarchical ViT, incorporates numerous visual tasks into a modest supernet and optimizes them jointly across various dataset domains. For the design of ViT-MVT, we augment the ViT with a multihead self-attention (MHSE) to offer complementary cues in the channel and spatial dimension, as well as a local perception unit (LPU) and locality feed-forward network (locality FFN) for information exchange in the local region, thus endowing ViT-MVT with the ability to effectively optimize multiple tasks. Besides, we construct a search space comprising potential architectures with a broad spectrum of model sizes to offer various optimum candidates for diverse tasks. After that, we design a layer-adaptive sharing technique that automatically determines whether each layer of the transformer block is shared or not for all tasks, enabling ViT-MVT to obtain task-shared parameters for a reduction of storage and task-specific parameters to learn task-related features such that boosting performance. Finally, we introduce a joint-task evolutionary search algorithm to discover an optimal backbone for all tasks under total model size constraint, which challenges the conventional wisdom that visual tasks are typically supplied with backbone networks developed for image classification. Extensive experiments reveal that ViT-MVT delivers exceptional performances for multiple visual tasks over state-of-the-art methods while necessitating considerably fewer total storage costs. We further demonstrate that once ViT-MVT has been trained, ViT-MVT is capable of incremental learning when generalized to new tasks while retaining identical performances for trained tasks. The code is available at https://github.com/XT-1997/vitmvt.
Collapse
|
2
|
Lian J, Wang L, Sun H, Huang H. GT-HAD: Gated Transformer for Hyperspectral Anomaly Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3631-3645. [PMID: 38347690 DOI: 10.1109/tnnls.2024.3355166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
Hyperspectral anomaly detection (HAD) aims to distinguish between the background and anomalies in a scene, which has been widely adopted in various applications. Deep neural network (DNN)-based methods have emerged as the predominant solution, wherein the standard paradigm is to discern the background and anomalies based on the error of self-supervised hyperspectral image (HSI) reconstruction. However, current DNN-based methods cannot guarantee correspondence between the background, anomalies, and reconstruction error, which limits the performance of HAD. In this article, we propose a novel gated transformer network for HAD (GT-HAD). Our key observation is that the spatial-spectral similarity in HSI can effectively distinguish between the background and anomalies, which aligns with the fundamental definition of HAD. Consequently, we develop GT-HAD to exploit the spatial-spectral similarity during HSI reconstruction. GT-HAD consists of two distinct branches that model the features of the background and anomalies, respectively, with content similarity as constraints. Furthermore, we introduce an adaptive gating unit to regulate the activation states of these two branches based on a content-matching method (CMM). Extensive experimental results demonstrate the superior performance of GT-HAD. The original code is publicly available at https://github.com/jeline0110/ GT-HAD, along with a comprehensive benchmark of state-of-the-art HAD methods.
Collapse
|
3
|
Chen J, Huang W, Zhang J, Debattista K, Han J. Addressing inconsistent labeling with cross image matching for scribble-based medical image segmentation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2025; PP:842-853. [PMID: 40031274 DOI: 10.1109/tip.2025.3530787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
In recent years, there has been a notable surge in the adoption of weakly-supervised learning for medical image segmentation, utilizing scribble annotation as a means to potentially reduce annotation costs. However, the inherent characteristics of scribble labeling, marked by incompleteness, subjectivity, and a lack of standardization, introduce inconsistencies into the annotations. These inconsistencies become significant challenges for the network's learning process, ultimately affecting the performance of segmentation. To address this challenge, we propose creating a reference set to guide pixel-level feature matching, constructed from class-specific tokens and pixel-level features extracted from variously images. Serving as a repository showcasing diverse pixel styles and classes, the reference set becomes the cornerstone for a pixel-level feature matching strategy. This strategy enables the effective comparison of unlabeled pixels, offering guidance, particularly in learning scenarios characterized by inconsistent and incomplete scribbles. The proposed strategy incorporates smoothing and regression techniques to align pixel-level features across different images. By leveraging the diversity of pixel sources, our matching approach enhances the network's ability to learn consistent patterns from the reference set. This, in turn, mitigates the impact of inconsistent and incomplete labeling, resulting in improved segmentation outcomes. Extensive experiments conducted on three publicly available datasets demonstrate the superiority of our approach over state-of-the-art methods in terms of segmentation accuracy and stability. The code will be made publicly available at https://github.com/jingkunchen/scribble-medical-segmentation.
Collapse
|
4
|
Płotka S, Szczepański T, Szenejko P, Korzeniowski P, Calvo JR, Khalil A, Shamshirsaz A, Brawura-Biskupski-Samaha R, Išgum I, Sánchez CI, Sitek A. Real-time placental vessel segmentation in fetoscopic laser surgery for Twin-to-Twin Transfusion Syndrome. Med Image Anal 2025; 99:103330. [PMID: 39260033 DOI: 10.1016/j.media.2024.103330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 06/07/2024] [Accepted: 08/27/2024] [Indexed: 09/13/2024]
Abstract
Twin-to-Twin Transfusion Syndrome (TTTS) is a rare condition that affects about 15% of monochorionic pregnancies, in which identical twins share a single placenta. Fetoscopic laser photocoagulation (FLP) is the standard treatment for TTTS, which significantly improves the survival of fetuses. The aim of FLP is to identify abnormal connections between blood vessels and to laser ablate them in order to equalize blood supply to both fetuses. However, performing fetoscopic surgery is challenging due to limited visibility, a narrow field of view, and significant variability among patients and domains. In order to enhance the visualization of placental vessels during surgery, we propose TTTSNet, a network architecture designed for real-time and accurate placental vessel segmentation. Our network architecture incorporates a novel channel attention module and multi-scale feature fusion module to precisely segment tiny placental vessels. To address the challenges posed by FLP-specific fiberscope and amniotic sac-based artifacts, we employed novel data augmentation techniques. These techniques simulate various artifacts, including laser pointer, amniotic sac particles, and structural and optical fiber artifacts. By incorporating these simulated artifacts during training, our network architecture demonstrated robust generalizability. We trained TTTSNet on a publicly available dataset of 2060 video frames from 18 independent fetoscopic procedures and evaluated it on a multi-center external dataset of 24 in-vivo procedures with a total of 2348 video frames. Our method achieved significant performance improvements compared to state-of-the-art methods, with a mean Intersection over Union of 78.26% for all placental vessels and 73.35% for a subset of tiny placental vessels. Moreover, our method achieved 172 and 152 frames per second on an A100 GPU, and Clara AGX, respectively. This potentially opens the door to real-time application during surgical procedures. The code is publicly available at https://github.com/SanoScience/TTTSNet.
Collapse
Affiliation(s)
- Szymon Płotka
- Sano Centre for Computational Medicine, Cracow, Poland; Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands; Department of Biomedical Engineering and Physics, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands
| | | | - Paula Szenejko
- First Department of Obstetrics and Gynecology, The University Center for Women and Newborn Health, Medical University of Warsaw, Warsaw, Poland
| | | | - Jesús Rodriguez Calvo
- Fetal Medicine Unit, Obstetrics and Gynecology Division, Complutense University of Madrid, Madrid, Spain
| | - Asma Khalil
- Fetal Medicine Unit, Saint George's Hospital, University of London, London, United Kingdom
| | - Alireza Shamshirsaz
- Maternal Fetal Care Center, Boston Children's Hospital, Boston, MA, United States of America; Harvard Medical School, Boston, MA, United States of America
| | | | - Ivana Išgum
- Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands; Department of Biomedical Engineering and Physics, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands; Department of Radiology and Nuclear Medicine, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands
| | - Clara I Sánchez
- Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands; Department of Biomedical Engineering and Physics, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, The Netherlands
| | - Arkadiusz Sitek
- Harvard Medical School, Boston, MA, United States of America; Center for Advanced Medical Computing and Simulation, Massachusetts General Hospital, Boston, MA, United States of America.
| |
Collapse
|
5
|
Liu L, Aviles-Rivero AI, Schonlieb CB. Contrastive Registration for Unsupervised Medical Image Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:147-159. [PMID: 37983143 DOI: 10.1109/tnnls.2023.3332003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Medical image segmentation is an important task in medical imaging, as it serves as the first step for clinical diagnosis and treatment planning. While major success has been reported using deep learning supervised techniques, they assume a large and well-representative labeled set. This is a strong assumption in the medical domain where annotations are expensive, time-consuming, and inherent to human bias. To address this problem, unsupervised segmentation techniques have been proposed in the literature. Yet, none of the existing unsupervised segmentation techniques reach accuracies that come even near to the state-of-the-art of supervised segmentation methods. In this work, we present a novel optimization model framed in a new convolutional neural network (CNN)-based contrastive registration architecture for unsupervised medical image segmentation called CLMorph. The core idea of our approach is to exploit image-level registration and feature-level contrastive learning, to perform registration-based segmentation. First, we propose an architecture to capture the image-to-image transformation mapping via registration for unsupervised medical image segmentation. Second, we embed a contrastive learning mechanism in the registration architecture to enhance the discriminative capacity of the network at the feature level. We show that our proposed CLMorph technique mitigates the major drawbacks of existing unsupervised techniques. We demonstrate, through numerical and visual experiments, that our technique substantially outperforms the current state-of-the-art unsupervised segmentation methods on two major medical image datasets.
Collapse
|
6
|
Song Z, Kang X, Wei X, Li S. Pixel-Centric Context Perception Network for Camouflaged Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:18576-18589. [PMID: 37819817 DOI: 10.1109/tnnls.2023.3319323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Camouflaged object detection (COD) aims to identify object pixels visually embedded in the background environment. Existing deep learning methods fail to utilize the context information around different pixels adequately and efficiently. In order to solve this problem, a novel pixel-centric context perception network (PCPNet) is proposed, the core of which is to customize the personalized context of each pixel based on the automatic estimation of its surroundings. Specifically, PCPNet first employs an elegant encoder equipped with the designed vital component generation (VCG) module to obtain a set of compact features rich in low-level spatial and high-level semantic information across multiple subspaces. Then, we present a parameter-free pixel importance estimation (PIE) function based on multiwindow information fusion. Object pixels with complex backgrounds will be assigned with higher PIE values. Subsequently, PIE is utilized to regularize the optimization loss. In this way, the network can pay more attention to those pixels with higher PIE values in the decoding stage. Finally, a local continuity refinement module (LCRM) is used to refine the detection results. Extensive experiments on four COD benchmarks, five salient object detection (SOD) benchmarks, and five polyp segmentation benchmarks demonstrate the superiority of PCPNet with respect to other state-of-the-art methods.
Collapse
|
7
|
Ma J, Bai Y, Zhong B, Zhang W, Yao T, Mei T. Visualizing and Understanding Patch Interactions in Vision Transformer. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13671-13680. [PMID: 37224360 DOI: 10.1109/tnnls.2023.3270479] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Vision transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of ViT, and there is no clear picture of how the attention mechanism with respect to the correlation across comprehensive patches will impact the performance and what is the further potential. In this work, we propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for ViT. Specifically, we first introduce a quantification indicator to measure the impact of patch interaction and verify such quantification on attention window design and indiscriminative patches removal. Then, we exploit the effective responsive field of each patch in ViT and devise a window-free transformer (WinfT) architecture accordingly. Extensive experiments on ImageNet demonstrate that the exquisitely designed quantitative method is shown able to facilitate ViT model learning, leading the top-1 accuracy by 4.28% at most. More remarkably, the results on downstream fine-grained recognition tasks further validate the generalization of our proposal.
Collapse
|
8
|
Fiorentino MC, Villani FP, Benito Herce R, González Ballester MA, Mancini A, López-Linares Román K. An intensity-based self-supervised domain adaptation method for intervertebral disc segmentation in magnetic resonance imaging. Int J Comput Assist Radiol Surg 2024; 19:1753-1761. [PMID: 38976178 PMCID: PMC11365836 DOI: 10.1007/s11548-024-03219-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 06/24/2024] [Indexed: 07/09/2024]
Abstract
BACKGROUND AND OBJECTIVE Accurate IVD segmentation is crucial for diagnosing and treating spinal conditions. Traditional deep learning methods depend on extensive, annotated datasets, which are hard to acquire. This research proposes an intensity-based self-supervised domain adaptation, using unlabeled multi-domain data to reduce reliance on large annotated datasets. METHODS The study introduces an innovative method using intensity-based self-supervised learning for IVD segmentation in MRI scans. This approach is particularly suited for IVD segmentations due to its ability to effectively capture the subtle intensity variations that are characteristic of spinal structures. The model, a dual-task system, simultaneously segments IVDs and predicts intensity transformations. This intensity-focused method has the advantages of being easy to train and computationally light, making it highly practical in diverse clinical settings. Trained on unlabeled data from multiple domains, the model learns domain-invariant features, adeptly handling intensity variations across different MRI devices and protocols. RESULTS Testing on three public datasets showed that this model outperforms baseline models trained on single-domain data. It handles domain shifts and achieves higher accuracy in IVD segmentation. CONCLUSIONS This study demonstrates the potential of intensity-based self-supervised domain adaptation for IVD segmentation. It suggests new directions for research in enhancing generalizability across datasets with domain shifts, which can be applied to other medical imaging fields.
Collapse
Affiliation(s)
| | | | - Rafael Benito Herce
- Digital Health and Biomedical Technologies, Vicomtech Foundation, San Sebastian, Spain
| | - Miguel Angel González Ballester
- BCN MedTech, Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, Spain
- Institución Catalana de Investigación y Estudios Avanzados (ICREA), Barcelona, Spain
| | - Adriano Mancini
- Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy
| | - Karen López-Linares Román
- Digital Health and Biomedical Technologies, Vicomtech Foundation, San Sebastian, Spain
- eHealth Group, Bioengineering Area, Biogipuzkoa Health Research Institute, San Sebastian, Spain
| |
Collapse
|
9
|
Song Y, Teoh JYC, Choi KS, Qin J. Dynamic Loss Weighting for Multiorgan Segmentation in Medical Images. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:10651-10662. [PMID: 37027749 DOI: 10.1109/tnnls.2023.3243241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Deep neural networks often suffer from performance inconsistency for multiorgan segmentation in medical images; some organs are segmented far worse than others. The main reason might be organs with different levels of learning difficulty for segmentation mapping, due to variations such as size, texture complexity, shape irregularity, and imaging quality. In this article, we propose a principled class-reweighting algorithm, termed dynamic loss weighting, which dynamically assigns a larger loss weight to organs if they are discriminated as more difficult to learn according to the data and network's status, for forcing the network to learn from them more to maximally promote the performance consistency. This new algorithm uses an extra autoencoder to measure the discrepancy between the segmentation network's output and the ground truth and dynamically estimates the loss weight of organs per the contribution of the organ to the new updated discrepancy. It can capture the variation in organs' learning difficult during training, and it is neither sensitive to data's property nor dependent on human priors. We evaluate this algorithm in two multiorgan segmentation tasks: abdominal organs and head-neck structures, on publicly available datasets, with positive results obtained from extensive experiments which confirm the validity and effectiveness. Source codes are available at: https://github.com/YouyiSong/Dynamic-Loss-Weighting.
Collapse
|
10
|
Zhao L, Tan G, Wu Q, Pu B, Ren H, Li S, Li K. FARN: Fetal Anatomy Reasoning Network for Detection With Global Context Semantic and Local Topology Relationship. IEEE J Biomed Health Inform 2024; 28:4866-4877. [PMID: 38648141 DOI: 10.1109/jbhi.2024.3392531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Accurate recognition of fetal anatomical structure is a pivotal task in ultrasound (US) image analysis. Sonographers naturally apply anatomical knowledge and clinical expertise to recognizing key anatomical structures in complex US images. However, mainstream object detection approaches usually treat each structure recognition separately, overlooking anatomical correlations between different structures in fetal US planes. In this work, we propose a Fetal Anatomy Reasoning Network (FARN) that incorporates two kinds of relationship forms: a global context semantic block summarized with visual similarity and a local topology relationship block depicting structural pair constraints. Specifically, by designing the Adaptive Relation Graph Reasoning (ARGR) module, anatomical structures are treated as nodes, with two kinds of relationships between nodes modeled as edges. The flexibility of the model is enhanced by constructing the adaptive relationship graph in a data-driven way, enabling adaptation to various data samples without the need for predefined additional constraints. The feature representation is further refined by aggregating the outputs of the ARGR module. Comprehensive experimental results demonstrate that FARN achieves promising performance in detecting 37 anatomical structures across key US planes in tertiary obstetric screening. FARN effectively utilizes key relationships to improve detection performance, demonstrates robustness to small-scale, similar, and indistinct structures, and avoids some detection errors that deviate from anatomical norms. Overall, our study serves as a resource for developing efficient and concise approaches to model inter-anatomy relationships.
Collapse
|
11
|
Liu M, Wu S, Chen R, Lin Z, Wang Y, Meijering E. Brain Image Segmentation for Ultrascale Neuron Reconstruction via an Adaptive Dual-Task Learning Network. IEEE TRANSACTIONS ON MEDICAL IMAGING 2024; 43:2574-2586. [PMID: 38373129 DOI: 10.1109/tmi.2024.3367384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/21/2024]
Abstract
Accurate morphological reconstruction of neurons in whole brain images is critical for brain science research. However, due to the wide range of whole brain imaging, uneven staining, and optical system fluctuations, there are significant differences in image properties between different regions of the ultrascale brain image, such as dramatically varying voxel intensities and inhomogeneous distribution of background noise, posing an enormous challenge to neuron reconstruction from whole brain images. In this paper, we propose an adaptive dual-task learning network (ADTL-Net) to quickly and accurately extract neuronal structures from ultrascale brain images. Specifically, this framework includes an External Features Classifier (EFC) and a Parameter Adaptive Segmentation Decoder (PASD), which share the same Multi-Scale Feature Encoder (MSFE). MSFE introduces an attention module named Channel Space Fusion Module (CSFM) to extract structure and intensity distribution features of neurons at different scales for addressing the problem of anisotropy in 3D space. Then, EFC is designed to classify these feature maps based on external features, such as foreground intensity distributions and image smoothness, and select specific PASD parameters to decode them of different classes to obtain accurate segmentation results. PASD contains multiple sets of parameters trained by different representative complex signal-to-noise distribution image blocks to handle various images more robustly. Experimental results prove that compared with other advanced segmentation methods for neuron reconstruction, the proposed method achieves state-of-the-art results in the task of neuron reconstruction from ultrascale brain images, with an improvement of about 49% in speed and 12% in F1 score.
Collapse
|
12
|
Wang J, Tang Y, Xiao Y, Zhou JT, Fang Z, Yang F. GREnet: Gradually REcurrent Network With Curriculum Learning for 2-D Medical Image Segmentation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:10018-10032. [PMID: 37022080 DOI: 10.1109/tnnls.2023.3238381] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Medical image segmentation is a vital stage in medical image analysis. Numerous deep-learning methods are booming to improve the performance of 2-D medical image segmentation, owing to the fast growth of the convolutional neural network. Generally, the manually defined ground truth is utilized directly to supervise models in the training phase. However, direct supervision of the ground truth often results in ambiguity and distractors as complex challenges appear simultaneously. To alleviate this issue, we propose a gradually recurrent network with curriculum learning, which is supervised by gradual information of the ground truth. The whole model is composed of two independent networks. One is the segmentation network denoted as GREnet, which formulates 2-D medical image segmentation as a temporal task supervised by pixel-level gradual curricula in the training phase. The other is a curriculum-mining network. To a certain degree, the curriculum-mining network provides curricula with an increasing difficulty in the ground truth of the training set by progressively uncovering hard-to-segmentation pixels via a data-driven manner. Given that segmentation is a pixel-level dense-prediction challenge, to the best of our knowledge, this is the first work to function 2-D medical image segmentation as a temporal task with pixel-level curriculum learning. In GREnet, the naive UNet is adopted as the backbone, while ConvLSTM is used to establish the temporal link between gradual curricula. In the curriculum-mining network, UNet++ supplemented by transformer is designed to deliver curricula through the outputs of the modified UNet++ at different layers. Experimental results have demonstrated the effectiveness of GREnet on seven datasets, i.e., three lesion segmentation datasets in dermoscopic images, an optic disc and cup segmentation dataset and a blood vessel segmentation dataset in retinal images, a breast lesion segmentation dataset in ultrasound images, and a lung segmentation dataset in computed tomography (CT).
Collapse
|
13
|
Liu M, Han Y, Wang J, Wang C, Wang Y, Meijering E. LSKANet: Long Strip Kernel Attention Network for Robotic Surgical Scene Segmentation. IEEE TRANSACTIONS ON MEDICAL IMAGING 2024; 43:1308-1322. [PMID: 38015689 DOI: 10.1109/tmi.2023.3335406] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2023]
Abstract
Surgical scene segmentation is a critical task in Robotic-assisted surgery. However, the complexity of the surgical scene, which mainly includes local feature similarity (e.g., between different anatomical tissues), intraoperative complex artifacts, and indistinguishable boundaries, poses significant challenges to accurate segmentation. To tackle these problems, we propose the Long Strip Kernel Attention network (LSKANet), including two well-designed modules named Dual-block Large Kernel Attention module (DLKA) and Multiscale Affinity Feature Fusion module (MAFF), which can implement precise segmentation of surgical images. Specifically, by introducing strip convolutions with different topologies (cascaded and parallel) in two blocks and a large kernel design, DLKA can make full use of region- and strip-like surgical features and extract both visual and structural information to reduce the false segmentation caused by local feature similarity. In MAFF, affinity matrices calculated from multiscale feature maps are applied as feature fusion weights, which helps to address the interference of artifacts by suppressing the activations of irrelevant regions. Besides, the hybrid loss with Boundary Guided Head (BGH) is proposed to help the network segment indistinguishable boundaries effectively. We evaluate the proposed LSKANet on three datasets with different surgical scenes. The experimental results show that our method achieves new state-of-the-art results on all three datasets with improvements of 2.6%, 1.4%, and 3.4% mIoU, respectively. Furthermore, our method is compatible with different backbones and can significantly increase their segmentation accuracy. Code is available at https://github.com/YubinHan73/LSKANet.
Collapse
|
14
|
Xu S, Duan L, Zhang Y, Zhang Z, Sun T, Tian L. Graph- and transformer-guided boundary aware network for medical image segmentation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 242:107849. [PMID: 37837887 DOI: 10.1016/j.cmpb.2023.107849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 09/29/2023] [Accepted: 10/06/2023] [Indexed: 10/16/2023]
Abstract
BACKGROUND AND OBJECTIVE Despite the considerable progress achieved by U-Net-based models, medical image segmentation remains a challenging task due to complex backgrounds, irrelevant noises, and ambiguous boundaries. In this study, we present a novel approach called U-shaped Graph- and Transformer-guided Boundary Aware Network (GTBA-Net) to tackle these challenges. METHODS GTBA-Net uses the pre-trained ResNet34 as its basic structure, and involves Global Feature Aggregation (GFA) modules for target localization, Graph-based Dynamic Feature Fusion (GDFF) modules for effective noise suppression, and Uncertainty-based Boundary Refinement (UBR) modules for accurate delineation of ambiguous boundaries. The GFA modules employ an efficient self-attention mechanism to facilitate coarse target localization amidst complex backgrounds, without introducing additional computational complexity. The GDFF modules leverage graph attention mechanism to aggregate information hidden among high- and low-level features, effectively suppressing target-irrelevant noises while preserving valuable spatial details. The UBR modules introduce an uncertainty quantification strategy and auxiliary loss to guide the model's focus towards target regions and uncertain "ridges", gradually mitigating boundary uncertainty and ultimately achieving accurate boundary delineation. RESULTS Comparative experiments on five datasets encompassing diverse modalities (including X-ray, CT, endoscopic procedures, and ultrasound) demonstrate that the proposed GTBA-Net outperforms existing methods in various challenging scenarios. Subsequent ablation studies further demonstrate the efficacy of the GFA, GDFF, and UBR modules in target localization, noise suppression, and ambiguous boundary delineation, respectively. CONCLUSIONS GTBA-Net exhibits substantial potential for extensive application in the field of medical image segmentation, particularly in scenarios involving complex backgrounds, target-irrelevant noises, or ambiguous boundaries.
Collapse
Affiliation(s)
- Shanshan Xu
- School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China; Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China
| | - Lianhong Duan
- The Second School of Clinical Medicine, Southern Medical University, Guangzhou, China; Senior Department of Orthopedics, The Fourth Medical Center of PLA General Hospital, Beijing, China
| | - Yang Zhang
- Senior Department of Orthopedics, The Fourth Medical Center of PLA General Hospital, Beijing, China
| | - Zhicheng Zhang
- Senior Department of Orthopedics, The Fourth Medical Center of PLA General Hospital, Beijing, China
| | - Tiansheng Sun
- The Second School of Clinical Medicine, Southern Medical University, Guangzhou, China; Senior Department of Orthopedics, The Fourth Medical Center of PLA General Hospital, Beijing, China.
| | - Lixia Tian
- School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China.
| |
Collapse
|
15
|
Jiang X, Zhu Y, Liu Y, Wang N, Yi L. MC-DC: An MLP-CNN Based Dual-path Complementary Network for Medical Image Segmentation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 242:107846. [PMID: 37806121 DOI: 10.1016/j.cmpb.2023.107846] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 10/03/2023] [Accepted: 10/04/2023] [Indexed: 10/10/2023]
Abstract
BACKGROUND Fusing the CNN and Transformer in the encoder has recently achieved outstanding performance in medical image segmentation. However, two obvious limitations require addressing: (1) The utilization of Transformer leads to heavy parameters, and its intricate structure demands ample data and resources for training, and (2) most previous research had predominantly focused on enhancing the performance of the feature encoder, with little emphasis placed on the design of the feature decoder. METHODS To this end, we propose a novel MLP-CNN based dual-path complementary (MC-DC) network for medical image segmentation, which replaces the complex Transformer with a cost-effective Multi-Layer Perceptron (MLP). Specifically, a dual-path complementary (DPC) module is designed to effectively fuse multi-level features from MLP and CNN. To respectively reconstruct global and local information, the dual-path decoder is proposed which is mainly composed of cross-scale global feature fusion (CS-GF) module and cross-scale local feature fusion (CS-LF) module. Moreover, we leverage a simple and efficient segmentation mask feature fusion (SMFF) module to merge the segmentation outcomes generated by the dual-path decoder. RESULTS Comprehensive experiments were performed on three typical medical image segmentation tasks. For skin lesions segmentation, our MC-DC network achieved 91.69% Dice and 9.52mm ASSD on the ISIC2018 dataset. In addition, the 91.6% Dice and 94.4% Dice were respectively obtained on the Kvasir-SEG dataset and CVC-ClinicDB dataset for polyp segmentation. Moreover, we also conducted experiments on the private COVID-DS36 dataset for lung lesion segmentation. Our MC-DC has achieved 87.6% [87.1%, 88.1%], and 92.3% [91.8%, 92.7%] on ground-glass opacity, interstitial infiltration, and lung consolidation, respectively. CONCLUSIONS The experimental results indicate that the proposed MC-DC network exhibits exceptional generalization capability and surpasses other state-of-the-art methods in higher results and lower computational complexity.
Collapse
Affiliation(s)
- Xiaoben Jiang
- School of Information Science and Technology, East China University of Science and Technology, Shanghai, 200237, China
| | - Yu Zhu
- School of Information Science and Technology, East China University of Science and Technology, Shanghai, 200237, China.
| | - Yatong Liu
- School of Information Science and Technology, East China University of Science and Technology, Shanghai, 200237, China
| | - Nan Wang
- School of Information Science and Technology, East China University of Science and Technology, Shanghai, 200237, China
| | - Lei Yi
- Department of Burn, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| |
Collapse
|
16
|
Shen W, Wang Y, Liu M, Wang J, Ding R, Zhang Z, Meijering E. Branch Aggregation Attention Network for Robotic Surgical Instrument Segmentation. IEEE TRANSACTIONS ON MEDICAL IMAGING 2023; 42:3408-3419. [PMID: 37342952 DOI: 10.1109/tmi.2023.3288127] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/23/2023]
Abstract
Surgical instrument segmentation is of great significance to robot-assisted surgery, but the noise caused by reflection, water mist, and motion blur during the surgery as well as the different forms of surgical instruments would greatly increase the difficulty of precise segmentation. A novel method called Branch Aggregation Attention network (BAANet) is proposed to address these challenges, which adopts a lightweight encoder and two designed modules, named Branch Balance Aggregation module (BBA) and Block Attention Fusion module (BAF), for efficient feature localization and denoising. By introducing the unique BBA module, features from multiple branches are balanced and optimized through a combination of addition and multiplication to complement strengths and effectively suppress noise. Furthermore, to fully integrate the contextual information and capture the region of interest, the BAF module is proposed in the decoder, which receives adjacent feature maps from the BBA module and localizes the surgical instruments from both global and local perspectives by utilizing a dual branch attention mechanism. According to the experimental results, the proposed method has the advantage of being lightweight while outperforming the second-best method by 4.03%, 1.53%, and 1.34% in mIoU scores on three challenging surgical instrument datasets, respectively, compared to the existing state-of-the-art methods. Code is available at https://github.com/SWT-1014/BAANet.
Collapse
|
17
|
Kuang H, Wang Y, Liang Y, Liu J, Wang J. BEA-Net: Body and Edge Aware Network With Multi-Scale Short-Term Concatenation for Medical Image Segmentation. IEEE J Biomed Health Inform 2023; 27:4828-4839. [PMID: 37578920 DOI: 10.1109/jbhi.2023.3304662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/16/2023]
Abstract
Medical image segmentation is indispensable for diagnosis and prognosis of many diseases. To improve the segmentation performance, this study proposes a new 2D body and edge aware network with multi-scale short-term concatenation for medical image segmentation. Multi-scale short-term concatenation modules which concatenate successive convolution layers with different receptive fields, are proposed for capturing multi-scale representations with fewer parameters. Body generation modules with feature adjustment based on weight map computing via enlarging the receptive fields, and edge generation modules with multi-scale convolutions using Sobel kernels for edge detection, are proposed to separately learn body and edge features from convolutional features in decoders, making the proposed network be body and edge aware. Based on the body and edge modules, we design parallel body and edge decoders whose outputs are fused to achieve the final segmentation. Besides, deep supervision from the body and edge decoders is applied to ensure the effectiveness of the generated body and edge features and further improve the final segmentation. The proposed method is trained and evaluated on six public medical image segmentation datasets to show its effectiveness and generality. Experimental results show that the proposed method achieves better average Dice similarity coefficient and 95% Hausdorff distance than several benchmarks on all used datasets. Ablation studies validate the effectiveness of the proposed multi-scale representation learning modules, body and edge generation modules and deep supervision.
Collapse
|
18
|
Liu Z, Lv Q, Yang Z, Li Y, Lee CH, Shen L. Recent progress in transformer-based medical image analysis. Comput Biol Med 2023; 164:107268. [PMID: 37494821 DOI: 10.1016/j.compbiomed.2023.107268] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 05/30/2023] [Accepted: 07/16/2023] [Indexed: 07/28/2023]
Abstract
The transformer is primarily used in the field of natural language processing. Recently, it has been adopted and shows promise in the computer vision (CV) field. Medical image analysis (MIA), as a critical branch of CV, also greatly benefits from this state-of-the-art technique. In this review, we first recap the core component of the transformer, the attention mechanism, and the detailed structures of the transformer. After that, we depict the recent progress of the transformer in the field of MIA. We organize the applications in a sequence of different tasks, including classification, segmentation, captioning, registration, detection, enhancement, localization, and synthesis. The mainstream classification and segmentation tasks are further divided into eleven medical image modalities. A large number of experiments studied in this review illustrate that the transformer-based method outperforms existing methods through comparisons with multiple evaluation metrics. Finally, we discuss the open challenges and future opportunities in this field. This task-modality review with the latest contents, detailed information, and comprehensive comparison may greatly benefit the broad MIA community.
Collapse
Affiliation(s)
- Zhaoshan Liu
- Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive 1, Singapore, 117575, Singapore.
| | - Qiujie Lv
- Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive 1, Singapore, 117575, Singapore; School of Intelligent Systems Engineering, Sun Yat-sen University, No. 66, Gongchang Road, Guangming District, 518107, China.
| | - Ziduo Yang
- Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive 1, Singapore, 117575, Singapore; School of Intelligent Systems Engineering, Sun Yat-sen University, No. 66, Gongchang Road, Guangming District, 518107, China.
| | - Yifan Li
- Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive 1, Singapore, 117575, Singapore.
| | - Chau Hung Lee
- Department of Radiology, Tan Tock Seng Hospital, 11 Jalan Tan Tock Seng, Singapore, 308433, Singapore.
| | - Lei Shen
- Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive 1, Singapore, 117575, Singapore.
| |
Collapse
|
19
|
Wang F, Xiao C, Jia T, Pan L, Du F, Wang Z. Hepatobiliary surgery based on intelligent image segmentation technology. Open Life Sci 2023; 18:20220674. [PMID: 37671090 PMCID: PMC10476479 DOI: 10.1515/biol-2022-0674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 07/01/2023] [Accepted: 07/12/2023] [Indexed: 09/07/2023] Open
Abstract
Liver disease is an important disease that seriously threatens human health. It accounts for the highest proportion in various malignant tumors, and its incidence rate and mortality are on the rise, seriously affecting human health. Modern imaging has developed rapidly, but the application of image segmentation in liver tumor surgery is still rare. The application of image processing technology represented by artificial intelligence (AI) in surgery can greatly improve the efficiency of surgery, reduce surgical complications, and reduce the cost of surgery. Hepatocellular carcinoma is the most common malignant tumor in the world, and its mortality is second only to lung cancer. The resection rate of liver cancer surgery is high, and it is a multidisciplinary surgery, so it is necessary to explore the possibility of effective switching between different disciplines. Resection of hepatobiliary and pancreatic tumors is one of the most challenging and lethal surgical procedures. The operation requires a high level of doctors' experience and understanding of anatomical structures. The surgical segmentation is slow and there may be obvious complications. Therefore, the surgical system needs to make full use of the relevant functions of AI technology and computer vision analysis software, and combine the processing strategy based on image processing algorithm and computer vision analysis model. Intelligent optimization algorithm, also known as modern heuristic algorithm, is an algorithm with global optimization performance, strong universality, and suitable for parallel processing. This algorithm generally has a strict theoretical basis, rather than relying solely on expert experience. In theory, the optimal solution or approximate optimal solution can be found in a certain time. This work studies the hepatobiliary surgery through intelligent image segmentation technology, and analyzes them through intelligent optimization algorithm. The research results showed that when other conditions were the same, there were three patients who had adverse reactions in hepatobiliary surgery through intelligent image segmentation technology, accounting for 10%. The number of patients with adverse reactions in hepatobiliary surgery by conventional methods was nine, accounting for 30%, which was significantly higher than the former, indicating a positive relationship between intelligent image segmentation technology and hepatobiliary surgery.
Collapse
Affiliation(s)
- Fuchuan Wang
- Faculty of Hepatology Medicine, Chinese People’s Liberation Army (PLA) General Hospital, Beijing100039, China
| | - Chaohui Xiao
- Faculty of Hepato-Biliary-Pancreatic Surgery, Chinese People’s Liberation Army (PLA) General Hospital, Beijing100853, China
| | - Tianye Jia
- Department of Laboratory, Fifth Medical Center, Chinese People’s Liberation Army (PLA) General Hospital, Beijing100039, China
| | - Liru Pan
- Faculty of Hepato-Biliary-Pancreatic Surgery, Chinese People’s Liberation Army (PLA) General Hospital, Beijing100853, China
| | - Fengxia Du
- Faculty of Hepatology Medicine, Chinese People’s Liberation Army (PLA) General Hospital, Beijing100039, China
| | - Zhaohai Wang
- Faculty of Hepato-Biliary-Pancreatic Surgery, Chinese People’s Liberation Army (PLA) General Hospital, Beijing100853, China
| |
Collapse
|
20
|
Khan S, Ali H, Shah Z. Identifying the role of vision transformer for skin cancer-A scoping review. Front Artif Intell 2023; 6:1202990. [PMID: 37529760 PMCID: PMC10388102 DOI: 10.3389/frai.2023.1202990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 07/03/2023] [Indexed: 08/03/2023] Open
Abstract
Introduction Detecting and accurately diagnosing early melanocytic lesions is challenging due to extensive intra- and inter-observer variabilities. Dermoscopy images are widely used to identify and study skin cancer, but the blurred boundaries between lesions and besieging tissues can lead to incorrect identification. Artificial Intelligence (AI) models, including vision transformers, have been proposed as a solution, but variations in symptoms and underlying effects hinder their performance. Objective This scoping review synthesizes and analyzes the literature that uses vision transformers for skin lesion detection. Methods The review follows the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Revise) guidelines. The review searched online repositories such as IEEE Xplore, Scopus, Google Scholar, and PubMed to retrieve relevant articles. After screening and pre-processing, 28 studies that fulfilled the inclusion criteria were included. Results and discussions The review found that the use of vision transformers for skin cancer detection has rapidly increased from 2020 to 2022 and has shown outstanding performance for skin cancer detection using dermoscopy images. Along with highlighting intrinsic visual ambiguities, irregular skin lesion shapes, and many other unwanted challenges, the review also discusses the key problems that obfuscate the trustworthiness of vision transformers in skin cancer diagnosis. This review provides new insights for practitioners and researchers to understand the current state of knowledge in this specialized research domain and outlines the best segmentation techniques to identify accurate lesion boundaries and perform melanoma diagnosis. These findings will ultimately assist practitioners and researchers in making more authentic decisions promptly.
Collapse
Affiliation(s)
| | | | - Zubair Shah
- College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Doha, Qatar
| |
Collapse
|
21
|
Liu L, Liang C, Xue Y, Chen T, Chen Y, Lan Y, Wen J, Shao X, Chen J. An Intelligent Diagnostic Model for Melasma Based on Deep Learning and Multimode Image Input. Dermatol Ther (Heidelb) 2023; 13:569-579. [PMID: 36577888 PMCID: PMC9884721 DOI: 10.1007/s13555-022-00874-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2022] [Accepted: 12/05/2022] [Indexed: 12/29/2022] Open
Abstract
INTRODUCTION The diagnosis of melasma is often based on the naked-eye judgment of physicians. However, this is a challenge for inexperienced physicians and non-professionals, and incorrect treatment might have serious consequences. Therefore, it is important to develop an accurate method for melasma diagnosis. The objective of this study is to develop and validate an intelligent diagnostic system based on deep learning for melasma images. METHODS A total of 8010 images in the VISIA system, comprising 4005 images of patients with melasma and 4005 images of patients without melasma, were collected for training and testing. Inspired by four high-performance structures (i.e., DenseNet, ResNet, Swin Transformer, and MobileNet), the performances of deep learning models in melasma and non-melasma binary classifiers were evaluated. Furthermore, considering that there were five modes of images for each shot in VISIA, we fused these modes via multichannel image input in different combinations to explore whether multimode images could improve network performance. RESULTS The proposed network based on DenseNet121 achieved the best performance with an accuracy of 93.68% and an area under the curve (AUC) of 97.86% on the test set for the melasma classifier. The results of the Gradient-weighted Class Activation Mapping showed that it was interpretable. In further experiments, for the five modes of the VISIA system, we found the best performing mode to be "BROWN SPOTS." Additionally, the combination of "NORMAL," "BROWN SPOTS," and "UV SPOTS" modes significantly improved the network performance, achieving the highest accuracy of 97.4% and AUC of 99.28%. CONCLUSIONS In summary, deep learning is feasible for diagnosing melasma. The proposed network not only has excellent performance with clinical images of melasma, but can also acquire high accuracy by using multiple modes of images in VISIA.
Collapse
Affiliation(s)
- Lin Liu
- Department of Dermatology, The First Affiliated Hospital of Chongqing Medical University, No.1 Youyi Road, Yuzhong District, Chongqing, 400016, China
- Medical Data Science Academy, Chongqing Medical University, Chongqing, China
| | - Chen Liang
- College of Computer Science, Sichuan University, Chengdu, Sichuan, China
| | - Yuzhou Xue
- Department of Cardiology and Institute of Vascular Medicine, Peking University Third Hospital, Beijing, China
| | - Tingqiao Chen
- Department of Dermatology, The First Affiliated Hospital of Chongqing Medical University, No.1 Youyi Road, Yuzhong District, Chongqing, 400016, China
| | - Yangmei Chen
- Department of Dermatology, The First Affiliated Hospital of Chongqing Medical University, No.1 Youyi Road, Yuzhong District, Chongqing, 400016, China
| | - Yufan Lan
- Chongqing Medical University, Chongqing, China
| | - Jiamei Wen
- Department of Otolaryngology-Head and Neck Surgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Xinyi Shao
- Department of Dermatology, The First Affiliated Hospital of Chongqing Medical University, No.1 Youyi Road, Yuzhong District, Chongqing, 400016, China
| | - Jin Chen
- Department of Dermatology, The First Affiliated Hospital of Chongqing Medical University, No.1 Youyi Road, Yuzhong District, Chongqing, 400016, China.
| |
Collapse
|