1
|
Liu K, Moon S. Dynamic Parallel Pyramid Networks for Scene Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6591-6601. [PMID: 34882564 DOI: 10.1109/tnnls.2021.3129227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Scene recognition is considered a challenging task of image recognition, mainly due to the presence of multiscale information of global layout and local objects in a given scene. Recent convolutional neural networks (CNNs) that can learn multiscale features have achieved remarkable progress in scene recognition. They have two limitations: 1) the receptive field (RF) size is fixed even though a scene may have large-scale variations and 2) they are computing and memory intensive, partially due to the representation of multiscales. To address these limitations, we propose a lightweight dynamic scene recognition approach based on a novel architectural unit, namely, a dynamic parallel pyramid (DPP) block, that can adaptively select RF size based on multiscale information from the input regarding channel dimensions. We encode multiscale features by applying different convolutional (CONV) kernels on different input tensor channels and then dynamically merge their output using a group attention mechanism followed by channel shuffling to generate the parallel feature pyramid. DPP can be easily incorporated with existing CNNs to develop new deep models, called DPP networks (DPP-Nets). Extensive experiments on large-scale scene image datasets, Places365 standard, Places365 challenge, the Massachusetts Institute of Technology (MIT) Indoor67, and Sun397 confirmed that the proposed method provides significant performance improvement compared with current state-of-the-art (SOTA) approaches. We also verified general applicability from compelling results on lightweight models of MobileNetV2 and ShuffleNetV2 on ImageNet-1k and small object centralized benchmarks on CIFAR-10 and CIFAR-100.
Collapse
|
2
|
Geller HA, Bartho R, Thömmes K, Redies C. Statistical image properties predict aesthetic ratings in abstract paintings created by neural style transfer. Front Neurosci 2022; 16:999720. [PMID: 36312022 PMCID: PMC9606769 DOI: 10.3389/fnins.2022.999720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 09/26/2022] [Indexed: 11/13/2022] Open
Abstract
Artificial intelligence has emerged as a powerful computational tool to create artworks. One application is Neural Style Transfer, which allows to transfer the style of one image, such as a painting, onto the content of another image, such as a photograph. In the present study, we ask how Neural Style Transfer affects objective image properties and how beholders perceive the novel (style-transferred) stimuli. In order to focus on the subjective perception of artistic style, we minimized the confounding effect of cognitive processing by eliminating all representational content from the input images. To this aim, we transferred the styles of 25 diverse abstract paintings onto 150 colored random-phase patterns with six different Fourier spectral slopes. This procedure resulted in 150 style-transferred stimuli. We then computed eight statistical image properties (complexity, self-similarity, edge-orientation entropy, variances of neural network features, and color statistics) for each image. In a rating study, we asked participants to evaluate the images along three aesthetic dimensions (Pleasing, Harmonious, and Interesting). Results demonstrate that not only objective image properties, but also subjective aesthetic preferences transferred from the original artworks onto the style-transferred images. The image properties of the style-transferred images explain 50 – 69% of the variance in the ratings. In the multidimensional space of statistical image properties, participants considered style-transferred images to be more Pleasing and Interesting if they were closer to a “sweet spot” where traditional Western paintings (JenAesthetics dataset) are represented. We conclude that NST is a useful tool to create novel artistic stimuli that preserve the image properties of the input style images. In the novel stimuli, we found a strong relationship between statistical image properties and subjective ratings, suggesting a prominent role of perceptual processing in the aesthetic evaluation of abstract images.
Collapse
|
3
|
Choe S, Seong H, Kim E. Indoor Place Category Recognition for a Cleaning Robot by Fusing a Probabilistic Approach and Deep Learning. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:7265-7276. [PMID: 33600336 DOI: 10.1109/tcyb.2021.3052499] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Indoor place category recognition for a cleaning robot is a problem in which a cleaning robot predicts the category of the indoor place using images captured by it. This is similar to scene recognition in computer vision as well as semantic mapping in robotics. Compared with scene recognition, the indoor place category recognition considered in this article differs as follows: 1) the indoor places include typical home objects; 2) a sequence of images instead of an isolated image is provided because the images are captured successively by a cleaning robot; and 3) the camera of the cleaning robot has a different view compared with those of cameras typically used by human beings. Compared with semantic mapping, indoor place category recognition can be considered as a component in semantic SLAM. In this article, a new method based on the combination of a probabilistic approach and deep learning is proposed to address indoor place category recognition for a cleaning robot. Concerning the probabilistic approach, a new place-object fusion method is proposed based on Bayesian inference. For deep learning, the proposed place-object fusion method is trained using a convolutional neural network in an end-to-end framework. Furthermore, a new recurrent neural network, called the Bayesian filtering network (BFN), is proposed to conduct time-domain fusion. Finally, the proposed method is applied to a benchmark dataset and a new dataset developed in this article, and its validity is demonstrated experimentally.
Collapse
|
4
|
Zhang L, Shang Y, Li P, Luo H, Shao L. Community-Aware Photo Quality Evaluation by Deeply Encoding Human Perception. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3136-3146. [PMID: 32735541 DOI: 10.1109/tcyb.2019.2937319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Computational photo quality evaluation is a useful technique in many tasks of computer vision and graphics, for example, photo retaregeting, 3-D rendering, and fashion recommendation. The conventional photo quality models are designed by characterizing the pictures from all communities (e.g., "architecture" and "colorful") indiscriminately, wherein community-specific features are not exploited explicitly. In this article, we develop a new community-aware photo quality evaluation framework. It uncovers the latent community-specific topics by a regularized latent topic model (LTM) and captures human visual quality perception by exploring multiple attributes. More specifically, given massive-scale online photographs from multiple communities, a novel ranking algorithm is proposed to measure the visual/semantic attractiveness of regions inside each photograph. Meanwhile, three attributes, namely: 1) photo quality scores; weak semantic tags; and inter-region correlations, are seamlessly and collaboratively incorporated during ranking. Subsequently, we construct the gaze shifting path (GSP) for each photograph by sequentially linking the top-ranking regions from each photograph, and an aggregation-based CNN calculates the deep representation for each GSP. Based on this, an LTM is proposed to model the GSP distribution from multiple communities in the latent space. To mitigate the overfitting problem caused by communities with very few photographs, a regularizer is incorporated into our LTM. Finally, given a test photograph, we obtain its deep GSP representation and its quality score is determined by the posterior probability of the regularized LTM. Comparative studies on four image sets have shown the competitiveness of our method. Besides, the eye-tracking experiments have demonstrated that our ranking-based GSPs are highly consistent with real human gaze movements.
Collapse
|
5
|
Zhang L, Ju X, Shang Y, Li X. Deeply Encoding Stable Patterns From Contaminated Data for Scenery Image Recognition. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:5671-5680. [PMID: 31794411 DOI: 10.1109/tcyb.2019.2951798] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Effectively recognizing different sceneries with complex backgrounds and varied lighting conditions plays an important role in modern AI systems. Competitive performance has recently been achieved by the deep scene categorization models. However, these models implicitly hypothesize that the image-level labels are 100% correct, which is too restrictive. Practically, the image-level labels for massive-scale scenery sets are usually calculated by external predictors such as ImageNet-CN. These labels can easily become contaminated because no predictors are completely accurate. This article proposes a new deep architecture that calculates scene categories by hierarchically deriving stable templates, which are discovered using a generative model. Specifically, we first construct a semantic space by incorporating image-level labels using subspace embedding. Afterward, it is noticeable that in the semantic space, the superpixel distributions from identically labeled images remain unchanged, regardless of the image-level label noises. On the basis of this observation, a probabilistic generative model learns the stable templates for each scene category. To deeply represent each scenery category, a novel aggregation network is developed to statistically concatenate the CNN features learned from scene annotations predicted by HSA. Finally, the learned deep representations are integrated into an image kernel, which is subsequently incorporated into a multiclass SVM for distinguishing scene categories. Thorough experiments have shown the performance of our method. As a byproduct, an empirical study of 33 SIFT-flow categories shows that the learned stable templates remain almost unchanged under a nearly 36% image label contamination rate.
Collapse
|
6
|
Zhang L, Pan Z, Shao L. Semi-Supervised Perception Augmentation for Aerial Photo Topologies Understanding. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:7803-7814. [PMID: 34003752 DOI: 10.1109/tip.2021.3079820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Intelligently understanding the sophisticated topological structures from aerial photographs is a useful technique in aerial image analysis. Conventional methods cannot fulfill this task due to the following challenges: 1) the topology number of an aerial photo increases exponentially with the topology size, which requires a fine-grained visual descriptor to discriminatively represent each topology; 2) identifying visually/semantically salient topologies within each aerial photo in a weakly-labeled context, owing to the unaffordable human resources required for pixel-level annotation; and 3) designing a cross-domain knowledge transferal module to augment aerial photo perception, since multi-resolution aerial photos are taken asynchronistically in practice. To handle the above problems, we propose a unified framework to understand aerial photo topologies, focusing on representing each aerial photo by a set of visually/semantically salient topologies based on human visual perception and further employing them for visual categorization. Specifically, we first extract multiple atomic regions from each aerial photo, and thereby graphlets are built to capture the each aerial photo topologically. Then, a weakly-supervised ranking algorithm selects a few semantically salient graphlets by seamlessly encoding multiple image-level attributes. Toward a visualizable and perception-aware framework, we construct gaze shifting path (GSP) by linking the top-ranking graphlets. Finally, we derive the deep GSP representation, and formulate a semi-supervised and cross-domain SVM to partition each aerial photo into multiple categories. The SVM utilizes the global composition from low-resolution counterparts to enhance the deep GSP features from high-resolution aerial photos which are partially-annotated. Extensive visualization results and categorization performance comparisons have demonstrated the competitiveness of our approach.
Collapse
|
7
|
Zhang L, Liang R, Yin J, Zhang D, Shao L. Scene Categorization by Deeply Learning Gaze Behavior in a Semisupervised Context. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4265-4276. [PMID: 31144650 DOI: 10.1109/tcyb.2019.2913016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Accurately recognizing different categories of sceneries with sophisticated spatial configurations is a useful technique in computer vision and intelligent systems, e.g., scene understanding and autonomous driving. Competitive accuracies have been observed by the deep recognition models recently. Nevertheless, these deep architectures cannot explicitly characterize human visual perception, that is, the sequence of gaze allocation and the subsequent cognitive processes when viewing each scenery. In this paper, a novel spatially aware aggregation network is proposed for scene categorization, where the human gaze behavior is discovered in a semisupervised setting. In particular, as semantically labeling a large quantity of scene images is labor-intensive, a semisupervised and structure-preserved non-negative matrix factorization (NMF) is proposed to detect a set of visually/semantically salient regions from each scenery. Afterward, the gaze shifting path (GSP) is engineered to characterize the process of humans perceiving each scene picture. To deeply describe each GSP, a novel spatially aware CNN termed SA-Net is developed. It accepts input regions with various shapes and statistically aggregates all the salient regions along each GSP. Finally, the learned deep GSP features from the entire scene images are fused into an image kernel, which is subsequently integrated into a kernel SVM to categorize different sceneries. Comparative experiments on six scene image sets have shown the advantage of our method.
Collapse
|
8
|
Abstract
Recently, convolutional neural networks (CNNs) have achieved great success in scene recognition. Compared with traditional hand-crafted features, CNN can be used to extract more robust and generalized features for scene recognition. However, the existing scene recognition methods based on CNN do not sufficiently take into account the relationship between image regions and categories when choosing local regions, which results in many redundant local regions and degrades recognition accuracy. In this paper, we propose an effective method for exploring discriminative regions of the scene image. Our method utilizes the gradient-weighted class activation mapping (Grad-CAM) technique and weakly supervised information to generate the attention map (AM) of scene images, dubbed WS-AM—weakly supervised attention map. The regions, where the local mean and the local center value are both large in the AM, correspond to the discriminative regions helpful for scene recognition. We sampled discriminative regions on multiple scales and extracted the features of large-scale and small-scale regions with two different pre-trained CNNs, respectively. The features from two different scales were aggregated by the improved vector of locally aggregated descriptor (VLAD) coding and max pooling, respectively. Finally, the pre-trained CNN was used to extract the global feature of the image in the fully- connected (fc) layer, and the local features were combined with the global feature to obtain the image representation. We validated the effectiveness of our method on three benchmark datasets: MIT Indoor 67, Scene 15, and UIUC Sports, and obtained 85.67%, 94.80%, and 95.12% accuracy, respectively. Compared with some state-of-the-art methods, the WS-AM method requires fewer local regions, so it has a better real-time performance.
Collapse
|