1
|
Liao H, Yuan J, Liu C, Zhang J, Yang Y, Liang H, Liu H, Chen S, Li Y. One novel transfer learning-based CLIP model combined with self-attention mechanism for differentiating the tumor-stroma ratio in pancreatic ductal adenocarcinoma. LA RADIOLOGIA MEDICA 2024; 129:1559-1574. [PMID: 39412688 DOI: 10.1007/s11547-024-01902-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Accepted: 10/05/2024] [Indexed: 11/12/2024]
Abstract
PURPOSE To develop a contrastive language-image pretraining (CLIP) model based on transfer learning and combined with self-attention mechanism to predict the tumor-stroma ratio (TSR) in pancreatic ductal adenocarcinoma on preoperative enhanced CT images, in order to understand the biological characteristics of tumors for risk stratification and guiding feature fusion during artificial intelligence-based model representation. MATERIAL AND METHODS This retrospective study collected a total of 207 PDAC patients from three hospitals. TSR assessments were performed on surgical specimens by pathologists and divided into high TSR and low TSR groups. This study developed one novel CLIP-adapter model that integrates the CLIP paradigm with a self-attention mechanism for better utilizing features from multi-phase imaging, thereby enhancing the accuracy and reliability of tumor-stroma ratio predictions. Additionally, clinical variables, traditional radiomics model and deep learning models (ResNet50, ResNet101, ViT_Base_32, ViT_Base_16) were constructed for comparison. RESULTS The models showed significant efficacy in predicting TSR in PDAC. The performance of the CLIP-adapter model based on multi-phase feature fusion was superior to that based on any single phase (arterial or venous phase). The CLIP-adapter model outperformed traditional radiomics models and deep learning models, with CLIP-adapter_ViT_Base_32 performing the best, achieving the highest AUC (0.978) and accuracy (0.921) in the test set. Kaplan-Meier survival analysis showed longer overall survival in patients with low TSR compared to those with high TSR. CONCLUSION The CLIP-adapter model designed in this study provides a safe and accurate method for predicting the TSR in PDAC. The feature fusion module based on multi-modal (image and text) and multi-phase (arterial and venous phase) significantly improves model performance.
Collapse
Affiliation(s)
- Hongfan Liao
- Department of Radiology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Jiang Yuan
- College of Computer and Information Science, Southwest University, Chongqing, 400715, China
| | - Chunhua Liu
- Department of Radiology, Daping Hospital, Army Medical University, Chongqing, China
| | - Jiao Zhang
- Department of Radiology, The Third Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Yaying Yang
- Department of Pathology, Molecular Medicine and Cancer Research Center, Chongqing Medical University, Chongqing, 400016, China
| | - Hongwei Liang
- Department of Radiology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Haotian Liu
- Department of Radiology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Shanxiong Chen
- College of Computer and Information Science, Southwest University, Chongqing, 400715, China.
| | - Yongmei Li
- Department of Radiology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China.
| |
Collapse
|
2
|
Peng Y, Fang J, Li B. Collaborate decision network based on cross-modal attention for social media microblog recognition. Sci Rep 2024; 14:25673. [PMID: 39465342 PMCID: PMC11514264 DOI: 10.1038/s41598-024-77025-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 10/18/2024] [Indexed: 10/29/2024] Open
Abstract
Recently, social media has gradually become an important topic in public opinion analysis field. Among social media, Microblog is one of the most important platforms because it is short, convenient, mobile and instantaneous. Social media microblog recognition well reflects the attitudes from an enormous big colony to a specific incident, either positive or negative, which can be used for deriving competitive intelligence, marketing strategies, detecting depression and so on. However, the existing methods usually use only text or image from internet but not take advantages of their complementary information to finalize the recognition, it limits the performance and robustness of the algorithms. In this paper, we present a collaborate decision network (CDN) based on cross-modal attention to exploit the discriminative attributes of multi modalities by data- and knowledge joint driven strategy in depth, and further improve the recognition performance. In addition, we collect and construct a visual-text microblog recognition dataset with 2854 samples to support the subsequent research of related fields. Finally, experimental reuslts on the collected dataset show the effectiveness and superiority of the proposed CDN.
Collapse
Affiliation(s)
- Yuxiang Peng
- School of Economics and Management, Xi'an University of Technology, Xi'an, 710048, China
| | - Jie Fang
- School of Telecommunication and Information Engineering, Xi'an University of Posts and Telecommunication, Xi'an, 710121, China.
| | - Bingxiang Li
- School of Economics and Management, Xi'an University of Technology, Xi'an, 710048, China
| |
Collapse
|
3
|
Xuan S, Yang M, Zhang S. Adapting Vision-Language Models via Learning to Inject Knowledge. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:5798-5809. [PMID: 39356597 DOI: 10.1109/tip.2024.3468884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/04/2024]
Abstract
Pre-trained vision-language models (VLM) such as CLIP, have demonstrated impressive zero-shot performance on various vision tasks. Trained on millions or even billions of image-text pairs, the text encoder has memorized a substantial amount of appearance knowledge. Such knowledge in VLM is usually leveraged by learning specific task-oriented prompts, which may limit its performance in unseen tasks. This paper proposes a new knowledge injection framework to pursue a generalizable adaption of VLM to downstream vision tasks. Instead of learning task-specific prompts, we extract task-agnostic knowledge features, and insert them into features of input images or texts. The fused features hence gain better discriminative capability and robustness to intra-category variances. Those knowledge features are generated by inputting learnable prompt sentences into text encoder of VLM, and extracting its multi-layer features. A new knowledge injection module (KIM) is proposed to refine text features or visual features using knowledge features. This knowledge injection framework enables both modalities to benefit from the rich knowledge memorized in the text encoder. Experiments show that our method outperforms recently proposed methods under few-shot learning, base-to-new classes generalization, cross-dataset transfer, and domain generalization settings. For instance, it outperforms CoOp by 4.5% under the few-shot learning setting, and CoCoOp by 4.4% under the base-to-new classes generalization setting. Our code will be released.
Collapse
|
4
|
Guo J, Qi L, Shi Y, Gao Y. SETA: Semantic-Aware Edge-Guided Token Augmentation for Domain Generalization. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:5622-5636. [PMID: 39365722 DOI: 10.1109/tip.2024.3470517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/06/2024]
Abstract
Domain generalization (DG) aims to enhance the model robustness against domain shifts without accessing target domains. A prevalent category of methods for DG is data augmentation, which focuses on generating virtual samples to simulate domain shifts. However, existing augmentation techniques in DG are mainly tailored for convolutional neural networks (CNNs), with limited exploration in token-based architectures, i.e., vision transformer (ViT) and multi-layer perceptrons (MLP) models. In this paper, we study the impact of prior CNN-based augmentation methods on token-based models, revealing their performance is suboptimal due to the lack of incentivizing the model to learn holistic shape information. To tackle the issue, we propose the Semantic-aware Edge-guided Token Augmentation (SETA) method. SETA transforms token features by perturbing local edge cues while preserving global shape features, thereby enhancing the model learning of shape information. To further enhance the generalization ability of the model, we introduce two stylized variants of our method combined with two state-of-the-art (SOTA) style augmentation methods in DG. We provide a theoretical insight into our method, demonstrating its effectiveness in reducing the generalization risk bound. Comprehensive experiments on five benchmarks prove that our method achieves SOTA performances across various ViT and MLP architectures. Our code is available at https://github.com/lingeringlight/SETA.
Collapse
|
5
|
Irfan B, Kuoppamäki S, Skantze G. Recommendations for designing conversational companion robots with older adults through foundation models. Front Robot AI 2024; 11:1363713. [PMID: 38860032 PMCID: PMC11163135 DOI: 10.3389/frobt.2024.1363713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 05/07/2024] [Indexed: 06/12/2024] Open
Abstract
Companion robots are aimed to mitigate loneliness and social isolation among older adults by providing social and emotional support in their everyday lives. However, older adults' expectations of conversational companionship might substantially differ from what current technologies can achieve, as well as from other age groups like young adults. Thus, it is crucial to involve older adults in the development of conversational companion robots to ensure that these devices align with their unique expectations and experiences. The recent advancement in foundation models, such as large language models, has taken a significant stride toward fulfilling those expectations, in contrast to the prior literature that relied on humans controlling robots (i.e., Wizard of Oz) or limited rule-based architectures that are not feasible to apply in the daily lives of older adults. Consequently, we conducted a participatory design (co-design) study with 28 older adults, demonstrating a companion robot using a large language model (LLM), and design scenarios that represent situations from everyday life. The thematic analysis of the discussions around these scenarios shows that older adults expect a conversational companion robot to engage in conversation actively in isolation and passively in social settings, remember previous conversations and personalize, protect privacy and provide control over learned data, give information and daily reminders, foster social skills and connections, and express empathy and emotions. Based on these findings, this article provides actionable recommendations for designing conversational companion robots for older adults with foundation models, such as LLMs and vision-language models, which can also be applied to conversational robots in other domains.
Collapse
Affiliation(s)
- Bahar Irfan
- Division of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Sanna Kuoppamäki
- Division of Health Informatics and Logistics, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Gabriel Skantze
- Division of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
| |
Collapse
|
6
|
Zhang L, Yan L, Li S, Li S. SASFF: A Video Synthesis Algorithm for Unstructured Array Cameras Based on Symmetric Auto-Encoding and Scale Feature Fusion. SENSORS (BASEL, SWITZERLAND) 2023; 24:5. [PMID: 38202869 PMCID: PMC10780834 DOI: 10.3390/s24010005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 12/12/2023] [Accepted: 12/17/2023] [Indexed: 01/12/2024]
Abstract
For the synthesis of ultra-large scene and ultra-high resolution videos, in order to obtain high-quality large-scene videos, high-quality video stitching and fusion are achieved through multi-scale unstructured array cameras. This paper proposes a network model image feature point extraction algorithm based on symmetric auto-encoding and scale feature fusion. By using the principle of symmetric auto-encoding, the hierarchical restoration of image feature location information is incorporated into the corresponding scale feature, along with deep separable convolution image feature extraction, which not only improves the performance of feature point detection but also significantly reduces the computational complexity of the network model. Based on the calculated high-precision feature point pairing information, a new image localization method is proposed based on area ratio and homography matrix scaling, which improves the speed and accuracy of the array camera image scale alignment and positioning, realizes high-definition perception of local details in large scenes, and obtains clearer synthesis effects of large scenes and high-quality stitched images. The experimental results show that the feature point extraction algorithm proposed in this paper has been experimentally compared with four typical algorithms using the HPatches dataset. The performance of feature point detection has been improved by an average of 4.9%, the performance of homography estimation has been improved by an average of 2.5%, the amount of computation has been reduced by 18%, the number of network model parameters has been reduced by 47%, and the synthesis of billion-pixel videos has been achieved, demonstrating practicality and robustness.
Collapse
Affiliation(s)
- Linliang Zhang
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China; (L.Y.); (S.L.); (S.L.)
- Shanxi Intelligent Transportation Institute Co., Ltd., Taiyuan 030036, China
| | - Lianshan Yan
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China; (L.Y.); (S.L.); (S.L.)
| | - Shuo Li
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China; (L.Y.); (S.L.); (S.L.)
| | - Saifei Li
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China; (L.Y.); (S.L.); (S.L.)
| |
Collapse
|