1
|
Wu J, Wang Z, Hong M, Ji W, Fu H, Xu Y, Xu M, Jin Y. Medical SAM adapter: Adapting segment anything model for medical image segmentation. Med Image Anal 2025; 102:103547. [PMID: 40121809 DOI: 10.1016/j.media.2025.103547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Revised: 12/31/2024] [Accepted: 03/07/2025] [Indexed: 03/25/2025]
Abstract
The Segment Anything Model (SAM) has recently gained popularity in the field of image segmentation due to its impressive capabilities in various segmentation tasks and its prompt-based interface. However, recent studies and individual experiments have shown that SAM underperforms in medical image segmentation due to the lack of medical-specific knowledge. This raises the question of how to enhance SAM's segmentation capability for medical images. We propose the Medical SAM Adapter (Med-SA), which is one of the first methods to integrate SAM into medical image segmentation. Med-SA uses a light yet effective adaptation technique instead of fine-tuning the SAM model, incorporating domain-specific medical knowledge into the segmentation model. We also propose Space-Depth Transpose (SD-Trans) to adapt 2D SAM to 3D medical images and Hyper-Prompting Adapter (HyP-Adpt) to achieve prompt-conditioned adaptation. Comprehensive evaluation experiments on 17 medical image segmentation tasks across various modalities demonstrate the superior performance of Med-SA while updating only 2% of the SAM parameters (13M). Our code is released at https://github.com/KidsWithTokens/Medical-SAM-Adapter.
Collapse
Affiliation(s)
- Junde Wu
- Department of Biomedical Engineering, National University of Singapore, Singapore
| | - Ziyue Wang
- Department of Electrical and Computer Engineering, National University of Singapore, Singapore
| | - Mingxuan Hong
- Department of Biomedical Engineering, National University of Singapore, Singapore
| | - Wei Ji
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2R3, Canada
| | - Huazhu Fu
- Institute of High-Performance Computing, Agency for Science, Technology and Research, 138632, Singapore
| | - Yanwu Xu
- Singapore Eye Research Institute, Singapore
| | - Min Xu
- Computer Vision Department, Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates; Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States of America
| | - Yueming Jin
- Department of Biomedical Engineering, National University of Singapore, Singapore; Department of Electrical and Computer Engineering, National University of Singapore, Singapore.
| |
Collapse
|
2
|
Zheng S, Rao J, Zhang J, Zhou L, Xie J, Cohen E, Lu W, Li C, Yang Y. Cross-Modal Graph Contrastive Learning with Cellular Images. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2404845. [PMID: 39031820 PMCID: PMC11348220 DOI: 10.1002/advs.202404845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 06/03/2024] [Indexed: 07/22/2024]
Abstract
Constructing discriminative representations of molecules lies at the core of a number of domains such as drug discovery, chemistry, and medicine. State-of-the-art methods employ graph neural networks and self-supervised learning (SSL) to learn unlabeled data for structural representations, which can then be fine-tuned for downstream tasks. Albeit powerful, these methods are pre-trained solely on molecular structures and thus often struggle with tasks involved in intricate biological processes. Here, it is proposed to assist the learning of molecular representation by using the perturbed high-content cell microscopy images at the phenotypic level. To incorporate the cross-modal pre-training, a unified framework is constructed to align them through multiple types of contrastive loss functions, which is proven effective in the formulated novel tasks to retrieve the molecules and corresponding images mutually. More importantly, the model can infer functional molecules according to cellular images generated by genetic perturbations. In parallel, the proposed model can transfer non-trivially to molecular property predictions, and has shown great improvement over clinical outcome predictions. These results suggest that such cross-modality learning can bridge molecules and phenotype to play important roles in drug discovery.
Collapse
Affiliation(s)
- Shuangjia Zheng
- Global Institute of Future TechnologyShanghai Jiaotong University UniversityShanghai200240China
| | - Jiahua Rao
- School of Computer Science and EngineeringSun Yat‐sen UniversityGuangzhou510000China
| | | | - Lianyu Zhou
- School of InformaticsXiamen UniversityXiamen361005China
| | - Jiancong Xie
- School of Computer Science and EngineeringSun Yat‐sen UniversityGuangzhou510000China
| | - Ethan Cohen
- IBENS, Ecole Normale SupérieurePSL Research InstituteParisFrance
| | - Wei Lu
- Galixir TechnologiesShanghai200100China
| | | | - Yuedong Yang
- School of Computer Science and EngineeringSun Yat‐sen UniversityGuangzhou510000China
| |
Collapse
|
3
|
Liu L, Song X, Wang M, Dai Y, Liu Y, Zhang L. AGDF-Net: Learning Domain Generalizable Depth Features With Adaptive Guidance Fusion. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:3137-3155. [PMID: 38090832 DOI: 10.1109/tpami.2023.3342634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Cross-domain generalizable depth estimation aims to estimate the depth of target domains (i.e., real-world) using models trained on the source domains (i.e., synthetic). Previous methods mainly use additional real-world domain datasets to extract depth specific information for cross-domain generalizable depth estimation. Unfortunately, due to the large domain gap, adequate depth specific information is hard to obtain and interference is difficult to remove, which limits the performance. To relieve these problems, we propose a domain generalizable feature extraction network with adaptive guidance fusion (AGDF-Net) to fully acquire essential features for depth estimation at multi-scale feature levels. Specifically, our AGDF-Net first separates the image into initial depth and weak-related depth components with reconstruction and contrary losses. Subsequently, an adaptive guidance fusion module is designed to sufficiently intensify the initial depth features for domain generalizable intensified depth features acquisition. Finally, taking intensified depth features as input, an arbitrary depth estimation network can be used for real-world depth estimation. Using only synthetic datasets, our AGDF-Net can be applied to various real-world datasets (i.e., KITTI, NYUDv2, NuScenes, DrivingStereo and CityScapes) with state-of-the-art performances. Furthermore, experiments with a small amount of real-world data in a semi-supervised setting also demonstrate the superiority of AGDF-Net over state-of-the-art approaches.
Collapse
|
4
|
Liu Y, Li G, Lin L. Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:11624-11641. [PMID: 37289602 DOI: 10.1109/tpami.2023.3284038] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering.
Collapse
|
5
|
Lin W, Ding X, Huang Y, Zeng H. Self-Supervised Video-Based Action Recognition With Disturbances. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:2493-2507. [PMID: 37099471 DOI: 10.1109/tip.2023.3269228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Self-supervised video-based action recognition is a challenging task, which needs to extract the principal information characterizing the action from content-diversified videos over large unlabeled datasets. However, most existing methods choose to exploit the natural spatio-temporal properties of video to obtain effective action representations from a visual perspective, while ignoring the exploration of the semantic that is closer to human cognition. For that, a self-supervised Video-based Action Recognition method with Disturbances called VARD, which extracts the principal information of the action in terms of the visual and semantic, is proposed. Specifically, according to cognitive neuroscience research, the recognition ability of humans is activated by visual and semantic attributes. An intuitive impression is that minor changes of the actor or scene in video do not affect one person's recognition of the action. On the other hand, different humans always make consistent opinions when they recognize the same action video. In other words, for an action video, the necessary information that remains constant despite the disturbances in the visual video or the semantic encoding process is sufficient to represent the action. Therefore, to learn such information, we construct a positive clip/embedding for each action video. Compared to the original video clip/embedding, the positive clip/embedding is disturbed visually/semantically by Video Disturbance and Embedding Disturbance. Our objective is to pull the positive closer to the original clip/embedding in the latent space. In this way, the network is driven to focus on the principal information of the action while the impact of sophisticated details and inconsequential variations is weakened. It is worthwhile to mention that the proposed VARD does not require optical flow, negative samples, and pretext tasks. Extensive experiments conducted on the UCF101 and HMDB51 datasets demonstrate that the proposed VARD effectively improves the strong baseline and outperforms multiple classical and advanced self-supervised action recognition methods.
Collapse
|
6
|
Wu J, Sun W, Gan T, Ding N, Jiang F, Shen J, Nie L. Neighbor-Guided Consistent and Contrastive Learning for Semi-Supervised Action Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:2215-2227. [PMID: 37040248 DOI: 10.1109/tip.2023.3265261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Semi-supervised learning has been well established in the area of image classification but remains to be explored in video-based action recognition. FixMatch is a state-of-the-art semi-supervised method for image classification, but it does not work well when transferred directly to the video domain since it only utilizes the single RGB modality, which contains insufficient motion information. Moreover, it only leverages highly-confident pseudo-labels to explore consistency between strongly-augmented and weakly-augmented samples, resulting in limited supervised signals, long training time, and insufficient feature discriminability. To address the above issues, we propose neighbor-guided consistent and contrastive learning (NCCL), which takes both RGB and temporal gradient (TG) as input and is based on the teacher-student framework. Due to the limitation of labelled samples, we first incorporate neighbors information as a self-supervised signal to explore the consistent property, which compensates for the lack of supervised signals and the shortcoming of long training time of FixMatch. To learn more discriminative feature representations, we further propose a novel neighbor-guided category-level contrastive learning term to minimize the intra-class distance and enlarge the inter-class distance. We conduct extensive experiments on four datasets to validate the effectiveness. Compared with the state-of-the-art methods, our proposed NCCL achieves superior performance with much lower computational cost.
Collapse
|
7
|
Men Q, Zhao H, Drukker L, Papageorghiou AT, Noble JA. Towards Standard Plane Prediction of Fetal Head Ultrasound with Domain Adaption. PROCEEDINGS. IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING 2023; 28:1-5. [PMID: 39247516 PMCID: PMC7616421 DOI: 10.1109/isbi53787.2023.10230542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/10/2024]
Abstract
Fetal Standard Plane (SP) acquisition is a key step in ultrasound based assessment of fetal health. The task detects an ultrasound (US) image with predefined anatomy. However, it requires skill to acquire a good SP in practice, and trainees and occasional users of ultrasound devices can find this challenging. In this work, we consider the task of automatically predicting the fetal head SP from the video approaching the SP. We adopt a domain transfer learning approach that maps the encoded spatial and temporal features of video in the source domain to the spatial representations of the desired SP image in the target domain, together with adversarial training to preserve the quality of the resulting image. Experimental results show that the predicted head plane is plausible and consistent with the anatomical features expected in a real SP. The proposed approach is motivated to support non-experts to find and analyse a trans-ventricular (TV) plane but could also be generalized to other planes, trimesters, and ultrasound imaging tasks for which standard planes are defined.
Collapse
Affiliation(s)
- Qianhui Men
- Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, UK
| | - He Zhao
- Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, UK
| | - Lior Drukker
- Nuffield Department of Women's & Reproductive Health, University of Oxford, UK
- Department of Obstetrics and Gynecology, Tel-Aviv University, Israel
| | | | - J Alison Noble
- Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, UK
| |
Collapse
|
8
|
Zhao X, Shen Y, Wang S, Zhang H. Generating Diverse Augmented Attributes for Generalized Zero Shot Learning. Pattern Recognit Lett 2023. [DOI: 10.1016/j.patrec.2023.01.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
9
|
Action Recognition Using Action Sequences Optimization and Two-Stream 3D Dilated Neural Network. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:6608448. [PMID: 35733557 PMCID: PMC9208928 DOI: 10.1155/2022/6608448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 04/28/2022] [Accepted: 05/24/2022] [Indexed: 12/03/2022]
Abstract
Effective extraction and representation of action information are critical in action recognition. The majority of existing methods fail to recognize actions accurately because of interference of background changes when the proportion of high-activity action areas is not reinforced and by using RGB flow alone or combined with optical flow. A novel recognition method using action sequences optimization and two-stream fusion network with different modalities is proposed to solve these problems. The method is based on shot segmentation and dynamic weighted sampling, and it reconstructs the video by reinforcing the proportion of high-activity action areas, eliminating redundant intervals, and extracting long-range temporal information. A two-stream 3D dilated neural network that integrates features of RGB and human skeleton information is also proposed. The human skeleton information strengthens the deep representation of humans for robust processing, alleviating the interference of background changes, and the dilated CNN enlarges the receptive field of feature extraction. Compared with existing approaches, the proposed method achieves superior or comparable classification accuracies on benchmark datasets UCF101 and HMDB51.
Collapse
|
10
|
Liu Y, Wang K, Liu L, Lan H, Lin L. TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1978-1993. [PMID: 35157584 DOI: 10.1109/tip.2022.3147032] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Video self-supervised learning is a challenging task, which requires significant expressive power from the model to leverage rich spatial-temporal knowledge and generate effective supervisory signals from large amounts of unlabeled videos. However, existing methods fail to increase the temporal diversity of unlabeled videos and ignore elaborately modeling multi-scale temporal dependencies in an explicit way. To overcome these limitations, we take advantage of the multi-scale temporal dependencies within videos and propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL), which jointly models the inter-snippet and intra-snippet temporal dependencies for temporal representation learning with a hybrid graph contrastive learning strategy. Specifically, a Spatial-Temporal Knowledge Discovering (STKD) module is first introduced to extract motion-enhanced spatial-temporal representations from videos based on the frequency domain analysis of discrete cosine transform. To explicitly model multi-scale temporal dependencies of unlabeled videos, our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter-snippet Temporal Contrastive Graphs (TCG). Then, specific contrastive learning modules are designed to maximize the agreement between nodes in different graph views. To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module which leverages the relational knowledge among video snippets to learn the global context representation and recalibrate the channel-wise features adaptively. Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks. The code is publicly available at https://github.com/YangLiu9208/TCGL.
Collapse
|
11
|
Liu Y, Wei YS, Yan H, Li GB, Lin L. Causal Reasoning Meets Visual Representation Learning: A Prospective Study. MACHINE INTELLIGENCE RESEARCH 2022; 19:485-511. [PMCID: PMC9638478 DOI: 10.1007/s11633-022-1362-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Accepted: 08/01/2022] [Indexed: 09/29/2023]
Abstract
Visual representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts of multimodal heterogeneous spatial/temporal/spatial-temporal data in the big data era, the lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. The majority of the existing methods tend to fit the original data/variable distributions and ignore the essential causal relations behind the multi-modal knowledge, which lacks unified guidance and analysis about why modern visual representation learning methods easily collapse into data bias and have limited generalization and cognitive abilities. Inspired by the strong inference ability of human-level agents, recent years have therefore witnessed great effort in developing causal reasoning paradigms to realize robust representation and model learning with good cognitive ability. In this paper, we conduct a comprehensive review of existing causal reasoning methods for visual representation learning, covering fundamental theories, models, and datasets. The limitations of current methods and datasets are also discussed. Moreover, we propose some prospective challenges, opportunities, and future research directions for benchmarking causal reasoning algorithms in visual representation learning. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently.
Collapse
Affiliation(s)
- Yang Liu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006 China
| | - Yu-Shen Wei
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006 China
| | - Hong Yan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006 China
| | - Guan-Bin Li
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006 China
| | - Liang Lin
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006 China
| |
Collapse
|
12
|
Kitaguchi D, Takeshita N, Matsuzaki H, Igaki T, Hasegawa H, Ito M. Development and Validation of a 3-Dimensional Convolutional Neural Network for Automatic Surgical Skill Assessment Based on Spatiotemporal Video Analysis. JAMA Netw Open 2021; 4:e2120786. [PMID: 34387676 PMCID: PMC8363914 DOI: 10.1001/jamanetworkopen.2021.20786] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
IMPORTANCE A high level of surgical skill is essential to prevent intraoperative problems. One important aspect of surgical education is surgical skill assessment, with pertinent feedback facilitating efficient skill acquisition by novices. OBJECTIVES To develop a 3-dimensional (3-D) convolutional neural network (CNN) model for automatic surgical skill assessment and to evaluate the performance of the model in classification tasks by using laparoscopic colorectal surgical videos. DESIGN, SETTING, AND PARTICIPANTS This prognostic study used surgical videos acquired prior to 2017. In total, 650 laparoscopic colorectal surgical videos were provided for study purposes by the Japan Society for Endoscopic Surgery, and 74 were randomly extracted. Every video had highly reliable scores based on the Endoscopic Surgical Skill Qualification System (ESSQS, range 1-100, with higher scores indicating greater surgical skill) established by the society. Data were analyzed June to December 2020. MAIN OUTCOMES AND MEASURES From the groups with scores less than the difference between the mean and 2 SDs, within the range spanning the mean and 1 SD, and greater than the sum of the mean and 2 SDs, 17, 26, and 31 videos, respectively, were randomly extracted. In total, 1480 video clips with a length of 40 seconds each were extracted for each surgical step (medial mobilization, lateral mobilization, inferior mesenteric artery transection, and mesorectal transection) and separated into 1184 training sets and 296 test sets. Automatic surgical skill classification was performed based on spatiotemporal video analysis using the fully automated 3-D CNN model, and classification accuracies and screening accuracies for the groups with scores less than the mean minus 2 SDs and greater than the mean plus 2 SDs were calculated. RESULTS The mean (SD) ESSQS score of all 650 intraoperative videos was 66.2 (8.6) points and for the 74 videos used in the study, 67.6 (16.1) points. The proposed 3-D CNN model automatically classified video clips into groups with scores less than the mean minus 2 SDs, within 1 SD of the mean, and greater than the mean plus 2 SDs with a mean (SD) accuracy of 75.0% (6.3%). The highest accuracy was 83.8% for the inferior mesenteric artery transection. The model also screened for the group with scores less than the mean minus 2 SDs with 94.1% sensitivity and 96.5% specificity and for group with greater than the mean plus 2 SDs with 87.1% sensitivity and 86.0% specificity. CONCLUSIONS AND RELEVANCE The results of this prognostic study showed that the proposed 3-D CNN model classified laparoscopic colorectal surgical videos with sufficient accuracy to be used for screening groups with scores greater than the mean plus 2 SDs and less than the mean minus 2 SDs. The proposed approach was fully automatic and easy to use for various types of surgery, and no special annotations or kinetics data extraction were required, indicating that this approach warrants further development for application to automatic surgical skill assessment.
Collapse
Affiliation(s)
- Daichi Kitaguchi
- Surgical Device Innovation Office, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
- Department of Colorectal Surgery, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
| | - Nobuyoshi Takeshita
- Surgical Device Innovation Office, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
- Department of Colorectal Surgery, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
| | - Hiroki Matsuzaki
- Surgical Device Innovation Office, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
| | - Takahiro Igaki
- Surgical Device Innovation Office, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
- Department of Colorectal Surgery, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
| | - Hiro Hasegawa
- Surgical Device Innovation Office, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
- Department of Colorectal Surgery, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
| | - Masaaki Ito
- Surgical Device Innovation Office, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
- Department of Colorectal Surgery, National Cancer Center Hospital East, Kashiwanoha, Kashiwa, Chiba, Japan
| |
Collapse
|
13
|
Liu Y, Wang K, Li G, Lin L. Semantics-Aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5573-5588. [PMID: 34110991 DOI: 10.1109/tip.2021.3086590] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Existing vision-based action recognition is susceptible to occlusion and appearance variations, while wearable sensors can alleviate these challenges by capturing human motion with one-dimensional time-series signals (e.g. acceleration, gyroscope, and orientation). For the same action, the knowledge learned from vision sensors (videos or images) and wearable sensors, may be related and complementary. However, there exists a significantly large modality difference between action data captured by wearable-sensor and vision-sensor in data dimension, data distribution, and inherent information content. In this paper, we propose a novel framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos) by adaptively transferring and distilling the knowledge from multiple wearable sensors. The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modalities. To preserve the local temporal relationship and facilitate employing visual deep learning models, we transform one-dimensional time-series signals of wearable sensors to two-dimensional images by designing a gramian angular field based virtual image generation model. Then, we introduce a novel Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM) to adaptively fuse intermediate representation knowledge from different teacher networks. Finally, to fully exploit and transfer the knowledge of multiple well-trained teacher networks to the student network, we propose a novel Graph-guided Semantically Discriminative Mapping (GSDM) module, which utilizes graph-guided ablation analysis to produce a good visual explanation to highlight the important regions across modalities and concurrently preserve the interrelations of original data. Experimental results on Berkeley-MHAD, UTD-MHAD, and MMAct datasets well demonstrate the effectiveness of our proposed SAKDN for adaptive knowledge transfer from wearable-sensors modalities to vision-sensors modalities. The code is publicly available at https://github.com/YangLiu9208/SAKDN.
Collapse
|
14
|
Wang J, Cheng MM, Jiang J. Domain Shift Preservation for Zero-Shot Domain Adaptation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5505-5517. [PMID: 34097610 DOI: 10.1109/tip.2021.3084354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In learning-based image processing a model that is learned in one domain often performs poorly in another since the image samples originate from different sources and thus have different distributions. Domain adaptation techniques alleviate the problem of domain shift by learning transferable knowledge from the source domain to the target domain. Zero-shot domain adaptation (ZSDA) refers to a category of challenging tasks in which no target-domain sample for the task of interest is accessible for training. To address this challenge, we propose a simple but effective method that is based on the strategy of domain shift preservation across tasks. First, we learn the shift between the source domain and the target domain from an irrelevant task for which sufficient data samples from both domains are available. Then, we transfer the domain shift to the task of interest under the hypothesis that different tasks may share the domain shift for a specified pair of domains. Via this strategy, we can learn a model for the unseen target domain of the task of interest. Our method uses two coupled generative adversarial networks (CoGANs) to capture the joint distribution of data samples in dual-domains and another generative adversarial network (GAN) to explicitly model the domain shift. The experimental results on image classification and semantic segmentation demonstrate the satisfactory performance of our method in transferring various kinds of domain shifts across tasks.
Collapse
|