1
|
Huang Y, Li L, Chen P, Wu H, Lin W, Shi G. Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:1205-1218. [PMID: 39504278 DOI: 10.1109/tpami.2024.3492259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2024]
Abstract
In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. 1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. 2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks.
Collapse
|
2
|
Conwell C, Graham D, Boccagno C, Vessel EA. The perceptual primacy of feeling: Affectless visual machines explain a majority of variance in human visually evoked affect. Proc Natl Acad Sci U S A 2025; 122:e2306025121. [PMID: 39847334 PMCID: PMC11789064 DOI: 10.1073/pnas.2306025121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Accepted: 08/27/2024] [Indexed: 01/24/2025] Open
Abstract
Looking at the world often involves not just seeing things, but feeling things. Modern feedforward machine vision systems that learn to perceive the world in the absence of active physiology, deliberative thought, or any form of feedback that resembles human affective experience offer tools to demystify the relationship between seeing and feeling, and to assess how much of visually evoked affective experiences may be a straightforward function of representation learning over natural image statistics. In this work, we deploy a diverse sample of 180 state-of-the-art deep neural network models trained only on canonical computer vision tasks to predict human ratings of arousal, valence, and beauty for images from multiple categories (objects, faces, landscapes, art) across two datasets. Importantly, we use the features of these models without additional learning, linearly decoding human affective responses from network activity in much the same way neuroscientists decode information from neural recordings. Aggregate analysis across our survey, demonstrates that predictions from purely perceptual models explain a majority of the explainable variance in average ratings of arousal, valence, and beauty alike. Finer-grained analysis within our survey (e.g. comparisons between shallower and deeper layers, or between randomly initialized, category-supervised, and self-supervised models) point to rich, preconceptual abstraction (learned from diversity of visual experience) as a key driver of these predictions. Taken together, these results provide further computational evidence for an information-processing account of visually evoked affect linked directly to efficient representation learning over natural image statistics, and hint at a computational locus of affective and aesthetic valuation immediately proximate to perception.
Collapse
Affiliation(s)
- Colin Conwell
- Department of Psychology, Harvard University, Cambridge, MA02139
| | - Daniel Graham
- Department of Psychological Science, Hobart and William Smith Colleges
| | - Chelsea Boccagno
- Department of Psychiatry, Massachusetts General Hospital, Boston, MA02114
- Department of Epidemiology, Harvard T.H. Chan School of Public Health
| | - Edward A. Vessel
- Department of Psychology, City College, City University of New York, New York, NY10031
| |
Collapse
|
3
|
Levering A, Marcos D, Jacobs N, Tuia D. Prompt-guided and multimodal landscape scenicness assessments with vision-language models. PLoS One 2024; 19:e0307083. [PMID: 39348404 PMCID: PMC11441650 DOI: 10.1371/journal.pone.0307083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 06/29/2024] [Indexed: 10/02/2024] Open
Abstract
Recent advances in deep learning and Vision-Language Models (VLM) have enabled efficient transfer to downstream tasks even when limited labelled training data is available, as well as for text to be directly compared to image content. These properties of VLMs enable new opportunities for the annotation and analysis of images. We test the potential of VLMs for landscape scenicness prediction, i.e., the aesthetic quality of a landscape, using zero- and few-shot methods. We experiment with few-shot learning by fine-tuning a single linear layer on a pre-trained VLM representation. We find that a model fitted to just a few hundred samples performs favourably compared to a model trained on hundreds of thousands of examples in a fully supervised way. We also explore the zero-shot prediction potential of contrastive prompting using positive and negative landscape aesthetic concepts. Our results show that this method outperforms a linear probe with few-shot learning when using a small number of samples to tune the prompt configuration. We introduce Landscape Prompt Ensembling (LPE), which is an annotation method for acquiring landscape scenicness ratings through rated text descriptions without needing an image dataset during annotation. We demonstrate that LPE can provide landscape scenicness assessments that are concordant with a dataset of image ratings. The success of zero- and few-shot methods combined with their ability to use text-based annotations highlights the potential for VLMs to provide efficient landscape scenicness assessments with greater flexibility.
Collapse
Affiliation(s)
- Alex Levering
- Laboratory of Geo-Information Science and Remote Sensing, Wageningen University, Wageningen, the Netherlands
- Instituut voor Milieuvraagstukken, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - Diego Marcos
- Inria, Université de Montpellier, Montpellier, France
| | - Nathan Jacobs
- McKelvey School of Engineering, Washington University in St. Louis, St. Louis, MO, United States of America
| | - Devis Tuia
- Ecole Polytechnique Fédérale de Lausanne, Environmental Computational Science and Earth Observation Laboratory, Sion, Switzerland
| |
Collapse
|
4
|
Fabijan A, Zawadzka-Fabijan A, Fabijan R, Zakrzewski K, Nowosławska E, Polis B. Artificial Intelligence in Medical Imaging: Analyzing the Performance of ChatGPT and Microsoft Bing in Scoliosis Detection and Cobb Angle Assessment. Diagnostics (Basel) 2024; 14:773. [PMID: 38611686 PMCID: PMC11011528 DOI: 10.3390/diagnostics14070773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 03/24/2024] [Accepted: 04/04/2024] [Indexed: 04/14/2024] Open
Abstract
Open-source artificial intelligence models (OSAIM) find free applications in various industries, including information technology and medicine. Their clinical potential, especially in supporting diagnosis and therapy, is the subject of increasingly intensive research. Due to the growing interest in artificial intelligence (AI) for diagnostic purposes, we conducted a study evaluating the capabilities of AI models, including ChatGPT and Microsoft Bing, in the diagnosis of single-curve scoliosis based on posturographic radiological images. Two independent neurosurgeons assessed the degree of spinal deformation, selecting 23 cases of severe single-curve scoliosis. Each posturographic image was separately implemented onto each of the mentioned platforms using a set of formulated questions, starting from 'What do you see in the image?' and ending with a request to determine the Cobb angle. In the responses, we focused on how these AI models identify and interpret spinal deformations and how accurately they recognize the direction and type of scoliosis as well as vertebral rotation. The Intraclass Correlation Coefficient (ICC) with a 'two-way' model was used to assess the consistency of Cobb angle measurements, and its confidence intervals were determined using the F test. Differences in Cobb angle measurements between human assessments and the AI ChatGPT model were analyzed using metrics such as RMSEA, MSE, MPE, MAE, RMSLE, and MAPE, allowing for a comprehensive assessment of AI model performance from various statistical perspectives. The ChatGPT model achieved 100% effectiveness in detecting scoliosis in X-ray images, while the Bing model did not detect any scoliosis. However, ChatGPT had limited effectiveness (43.5%) in assessing Cobb angles, showing significant inaccuracy and discrepancy compared to human assessments. This model also had limited accuracy in determining the direction of spinal curvature, classifying the type of scoliosis, and detecting vertebral rotation. Overall, although ChatGPT demonstrated potential in detecting scoliosis, its abilities in assessing Cobb angles and other parameters were limited and inconsistent with expert assessments. These results underscore the need for comprehensive improvement of AI algorithms, including broader training with diverse X-ray images and advanced image processing techniques, before they can be considered as auxiliary in diagnosing scoliosis by specialists.
Collapse
Affiliation(s)
- Artur Fabijan
- Department of Neurosurgery, Polish-Mother’s Memorial Hospital Research Institute, 93-338 Lodz, Poland; (K.Z.); (E.N.); (B.P.)
| | - Agnieszka Zawadzka-Fabijan
- Department of Rehabilitation Medicine, Faculty of Health Sciences, Medical University of Lodz, 90-419 Lodz, Poland;
| | | | - Krzysztof Zakrzewski
- Department of Neurosurgery, Polish-Mother’s Memorial Hospital Research Institute, 93-338 Lodz, Poland; (K.Z.); (E.N.); (B.P.)
| | - Emilia Nowosławska
- Department of Neurosurgery, Polish-Mother’s Memorial Hospital Research Institute, 93-338 Lodz, Poland; (K.Z.); (E.N.); (B.P.)
| | - Bartosz Polis
- Department of Neurosurgery, Polish-Mother’s Memorial Hospital Research Institute, 93-338 Lodz, Poland; (K.Z.); (E.N.); (B.P.)
| |
Collapse
|
5
|
Fabijan A, Polis B, Fabijan R, Zakrzewski K, Nowosławska E, Zawadzka-Fabijan A. Artificial Intelligence in Scoliosis Classification: An Investigation of Language-Based Models. J Pers Med 2023; 13:1695. [PMID: 38138922 PMCID: PMC10744696 DOI: 10.3390/jpm13121695] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 12/03/2023] [Accepted: 12/07/2023] [Indexed: 12/24/2023] Open
Abstract
Open-source artificial intelligence models are finding free application in various industries, including computer science and medicine. Their clinical potential, especially in assisting diagnosis and therapy, is the subject of increasingly intensive research. Due to the growing interest in AI for diagnostics, we conducted a study evaluating the abilities of AI models, including ChatGPT, Microsoft Bing, and Scholar AI, in classifying single-curve scoliosis based on radiological descriptions. Fifty-six posturographic images depicting single-curve scoliosis were selected and assessed by two independent neurosurgery specialists, who classified them as mild, moderate, or severe based on Cobb angles. Subsequently, descriptions were developed that accurately characterized the degree of spinal deformation, based on the measured values of Cobb angles. These descriptions were then provided to AI language models to assess their proficiency in diagnosing spinal pathologies. The artificial intelligence models conducted classification using the provided data. Our study also focused on identifying specific sources of information and criteria applied in their decision-making algorithms, aiming for a deeper understanding of the determinants influencing AI decision processes in scoliosis classification. The classification quality of the predictions was evaluated using performance evaluation metrics such as sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and balanced accuracy. Our study strongly supported our hypothesis, showing that among four AI models, ChatGPT 4 and Scholar AI Premium excelled in classifying single-curve scoliosis with perfect sensitivity and specificity. These models demonstrated unmatched rater concordance and excellent performance metrics. In comparing real and AI-generated scoliosis classifications, they showed impeccable precision in all posturographic images, indicating total accuracy (1.0, MAE = 0.0) and remarkable inter-rater agreement, with a perfect Fleiss' Kappa score. This was consistent across scoliosis cases with a Cobb's angle range of 11-92 degrees. Despite high accuracy in classification, each model used an incorrect angular range for the mild stage of scoliosis. Our findings highlight the immense potential of AI in analyzing medical data sets. However, the diversity in competencies of AI models indicates the need for their further development to more effectively meet specific needs in clinical practice.
Collapse
Affiliation(s)
- Artur Fabijan
- Department of Neurosurgery, Polish-Mother’s Memorial Hospital Research Institute, 93-338 Lodz, Poland; (B.P.); (K.Z.); (E.N.)
| | - Bartosz Polis
- Department of Neurosurgery, Polish-Mother’s Memorial Hospital Research Institute, 93-338 Lodz, Poland; (B.P.); (K.Z.); (E.N.)
| | | | - Krzysztof Zakrzewski
- Department of Neurosurgery, Polish-Mother’s Memorial Hospital Research Institute, 93-338 Lodz, Poland; (B.P.); (K.Z.); (E.N.)
| | - Emilia Nowosławska
- Department of Neurosurgery, Polish-Mother’s Memorial Hospital Research Institute, 93-338 Lodz, Poland; (B.P.); (K.Z.); (E.N.)
| | - Agnieszka Zawadzka-Fabijan
- Department of Rehabilitation Medicine, Faculty of Health Sciences, Medical University of Lodz, 90-419 Lodz, Poland;
| |
Collapse
|
6
|
Music A, Maerten AS, Wagemans J. Beautification of images by generative adversarial networks. J Vis 2023; 23:14. [PMID: 37733338 PMCID: PMC10528684 DOI: 10.1167/jov.23.10.14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2022] [Accepted: 08/14/2023] [Indexed: 09/22/2023] Open
Abstract
Finding the properties underlying beauty has always been a prominent yet difficult problem. However, new technological developments have often aided scientific progress by expanding the scientists' toolkit. Currently in the spotlight of cognitive neuroscience and vision science are deep neural networks. In this study, we have used a generative adversarial network (GAN) to generate images of increasing aesthetic value. We validated that this network indeed was able to increase the aesthetic value of an image by letting participants decide which of two presented images they considered more beautiful. As our validation was successful, we were justified to use the generated images to extract low- and mid-level features contributing to their aesthetic value. We compared the brightness, contrast, sharpness, saturation, symmetry, colorfulness, and visual complexity levels of "low-aesthetic" images to those of "high-aesthetic" images. We found that all of these features increased for the beautiful images, implying that they may play an important role underlying the aesthetic value of an image. With this study, we have provided further evidence for the potential value GANs may have for research concerning beauty.
Collapse
Affiliation(s)
- Amar Music
- Department of Brain and Cognition, KU Leuven, Leuven, Belgium
| | | | - Johan Wagemans
- Department of Brain and Cognition, KU Leuven, Leuven, Belgium
| |
Collapse
|
7
|
Fabijan A, Fabijan R, Zawadzka-Fabijan A, Nowosławska E, Zakrzewski K, Polis B. Evaluating Scoliosis Severity Based on Posturographic X-ray Images Using a Contrastive Language-Image Pretraining Model. Diagnostics (Basel) 2023; 13:2142. [PMID: 37443536 DOI: 10.3390/diagnostics13132142] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 06/10/2023] [Accepted: 06/20/2023] [Indexed: 07/15/2023] Open
Abstract
Assessing severe scoliosis requires the analysis of posturographic X-ray images. One way to analyse these images may involve the use of open-source artificial intelligence models (OSAIMs), such as the contrastive language-image pretraining (CLIP) system, which was designed to combine images with text. This study aims to determine whether the CLIP model can recognise visible severe scoliosis in posturographic X-ray images. This study used 23 posturographic images of patients diagnosed with severe scoliosis that were evaluated by two independent neurosurgery specialists. Subsequently, the X-ray images were input into the CLIP system, where they were subjected to a series of questions with varying levels of difficulty and comprehension. The predictions obtained using the CLIP models in the form of probabilities ranging from 0 to 1 were compared with the actual data. To evaluate the quality of image recognition, true positives, false negatives, and sensitivity were determined. The results of this study show that the CLIP system can perform a basic assessment of X-ray images showing visible severe scoliosis with a high level of sensitivity. It can be assumed that, in the future, OSAIMs dedicated to image analysis may become commonly used to assess X-ray images, including those of scoliosis.
Collapse
Affiliation(s)
- Artur Fabijan
- Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland
| | | | | | - Emilia Nowosławska
- Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland
| | - Krzysztof Zakrzewski
- Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland
| | - Bartosz Polis
- Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland
| |
Collapse
|