1
|
Song X, Tan Y, Li X, Hei X. GDVIFNet: A generated depth and visible image fusion network with edge feature guidance for salient object detection. Neural Netw 2025; 188:107445. [PMID: 40209304 DOI: 10.1016/j.neunet.2025.107445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 03/25/2025] [Accepted: 03/27/2025] [Indexed: 04/12/2025]
Abstract
In recent years, despite significant advancements in salient object detection (SOD), performance in complex interference environments remains suboptimal. To address these challenges, additional modalities like depth (SOD-D) or thermal imaging (SOD-T) are often introduced. However, existing methods typically rely on specialized depth or thermal devices to capture these modalities, which can be costly and inconvenient. To address this limitation using only a single RGB image, we propose GDVIFNet, a novel approach that leverages Depth Anything to generate depth images. Since these generated depth images may contain noise and artifacts, we incorporate self-supervised techniques to generate edge feature information. During the process of generating image edge features, the noise and artifacts present in the generated depth images can be effectively removed. Our method employs a dual-branch architecture, combining CNN and Transformer-based branches for feature extraction. We designed the step trimodal interaction unit (STIU) to fuse the RGB features with the depth features from the CNN branch and the self-cross attention fusion (SCF) to integrate RGB features with depth features from the Transformer branch. Finally, guided by edge features from our self-supervised edge guidance module (SEGM), we employ the CNN-Edge-Transformer step fusion (CETSF) to fuse features from both branches. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple datasets. Code can be found at https://github.com/typist2001/GDVIFNet.
Collapse
Affiliation(s)
- Xiaogang Song
- Xi'an University of Technology, School of Computer Science and Engineering, Xi'an, 710048, China; Engineering Research Center of Human-machine integration intelligent robot, Universities of Shaanxi Province, Xi'an, 710048, China.
| | - Yuping Tan
- Xi'an University of Technology, School of Computer Science and Engineering, Xi'an, 710048, China.
| | - Xiaochang Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China.
| | - Xinhong Hei
- Xi'an University of Technology, School of Computer Science and Engineering, Xi'an, 710048, China; Engineering Research Center of Human-machine integration intelligent robot, Universities of Shaanxi Province, Xi'an, 710048, China.
| |
Collapse
|
2
|
Huang Y, Li L, Chen P, Wu H, Lin W, Shi G. Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:1205-1218. [PMID: 39504278 DOI: 10.1109/tpami.2024.3492259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2024]
Abstract
In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. 1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. 2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks.
Collapse
|
3
|
Zhou J, Ren F. Scene categorization by Hessian-regularized active perceptual feature selection. Sci Rep 2025; 15:739. [PMID: 39753661 PMCID: PMC11698863 DOI: 10.1038/s41598-024-84181-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 12/20/2024] [Indexed: 01/06/2025] Open
Abstract
Decoding the semantic categories of complex sceneries is fundamental to numerous artificial intelligence (AI) infrastructures. This work presents an advanced selection of multi-channel perceptual visual features for recognizing scenic images with elaborate spatial structures, focusing on developing a deep hierarchical model dedicated to learning human gaze behavior. Utilizing the BING objectness measure, we efficiently localize objects or their details across varying scales within scenes. To emulate humans observing semantically or visually significant areas within scenes, we propose a robust deep active learning (RDAL) strategy. This strategy progressively generates gaze shifting paths (GSP) and calculates deep GSP representations within a unified architecture. A notable advantage of RDAL is the robustness to label noise, which is implemented by a carefully-designed sparse penalty term. This mechanism ensures that irrelevant or misleading deep GSP features are intelligently discarded. Afterward, a novel Hessian-regularized Feature Selector (HFS) is proposed to select high-quality features from the deep GSP features, wherein (i) the spatial composition of scenic patches can be optimally maintained, and (ii) a linear SVM is learned simultaneously. Empirical evaluations across six standard scenic datasets demonstrated our method's superior performance, highlighting its exceptional ability to differentiate various sophisticated scenery categories.
Collapse
Affiliation(s)
- Junwu Zhou
- School of Higher Vocational and Technical College, Shanghai Dianji University, Shanghai, 201306, China
| | - Fuji Ren
- College of Computer Sciences, Anhui University, Hefei, 230039, China.
| |
Collapse
|
4
|
Jian Z, Song T, Zhang Z, Ai Z, Zhao H, Tang M, Liu K. Deep learning method for detecting fluorescence spots in cancer diagnostics via fluorescence in situ hybridization. Sci Rep 2024; 14:27231. [PMID: 39516673 PMCID: PMC11549464 DOI: 10.1038/s41598-024-78571-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 10/31/2024] [Indexed: 11/16/2024] Open
Abstract
Fluorescence in Situ Hybridization (FISH) is a technique for macromolecule identification that utilizes the complementarity of DNA or DNA/RNA double strands. Probes, crafted from selected DNA strands tagged with fluorophore-coupled nucleotides, hybridize to complementary sequences within the cells and tissues under examination. These are subsequently visualized through fluorescence microscopy or imaging systems. However, the vast number of cells and disorganized nucleic acid sequences in FISH images present significant challenges. The manual processing and analysis of these images are not only time-consuming but also prone to human error due to visual fatigue. To overcome these challenges, we propose the integration of medical imaging with deep learning to develop an automated detection system for FISH images. This system features an algorithm capable of quickly detecting fluorescent spots and capturing their coordinates, which is crucial for evaluating cellular characteristics in cancer diagnosis. Traditional models struggle with the small size, low resolution, and noise prevalent in fluorescent points, leading to significant performance declines. This paper offers a detailed examination of these issues, providing insights into why traditional models falter. Comparative tests between the YOLO series models and our proposed method affirm the superior accuracy of our approach in identifying fluorescent dots in FISH images.
Collapse
Affiliation(s)
- Zini Jian
- School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan, 430200, China.
| | - Tianxiang Song
- School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan, 430200, China
| | - Zhihui Zhang
- School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan, 430200, China
| | - Zhao Ai
- School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan, 430200, China
| | - Heng Zhao
- School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan, 430200, China
| | - Man Tang
- School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan, 430200, China
| | - Kan Liu
- School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan, 430200, China.
| |
Collapse
|
5
|
Qiao M, Xu M, Jiang L, Lei P, Wen S, Chen Y, Sigal L. HyperSOR: Context-Aware Graph Hypernetwork for Salient Object Ranking. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:5873-5889. [PMID: 38381637 DOI: 10.1109/tpami.2024.3368158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Salient object ranking (SOR) aims to segment salient objects in an image and simultaneously predict their saliency rankings, according to the shifted human attention over different objects. The existing SOR approaches mainly focus on object-based attention, e.g., the semantic and appearance of object. However, we find that the scene context plays a vital role in SOR, in which the saliency ranking of the same object varies a lot at different scenes. In this paper, we thus make the first attempt towards explicitly learning scene context for SOR. Specifically, we establish a large-scale SOR dataset of 24,373 images with rich context annotations, i.e., scene graphs, segmentation, and saliency rankings. Inspired by the data analysis on our dataset, we propose a novel graph hypernetwork, named HyperSOR, for context-aware SOR. In HyperSOR, an initial graph module is developed to segment objects and construct an initial graph by considering both geometry and semantic information. Then, a scene graph generation module with multi-path graph attention mechanism is designed to learn semantic relationships among objects based on the initial graph. Finally, a saliency ranking prediction module dynamically adopts the learned scene context through a novel graph hypernetwork, for inferring the saliency rankings. Experimental results show that our HyperSOR can significantly improve the performance of SOR.
Collapse
|
6
|
Zhu G, Li J, Guo Y. Supplement and Suppression: Both Boundary and Nonboundary Are Helpful for Salient Object Detection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6615-6627. [PMID: 34818196 DOI: 10.1109/tnnls.2021.3127959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Current methods aggregate multilevel features from the backbone and introduce edge information to get more refined saliency maps. However, little attention is paid to how to suppress the regions with similar saliency appearances in the background. These regions usually exist in the vicinity of salient objects and have high contrast with the background, which is easy to be misclassified as foreground. To solve this problem, we propose a gated feature interaction network (GFINet) to integrate multiple saliency features, which can utilize nonboundary features with background information to suppress pseudosalient objects and simultaneously apply boundary features to supplement edge details. Different from previous methods that only consider the complementarity between saliency and boundary, the proposed network introduces nonboundary features into the decoder to filter the pseudosalient objects. Specifically, GFINet consists of global features aggregation branch (GFAB), boundary and nonboundary features' perception branch (B&NFPB), and gated feature interaction module (GFIM). According to the global features generated by GFAB, boundary and nonboundary features produced by B&NFPB and GFIM employ a gate structure to adaptively optimize the saliency information interchange between abovementioned features and, thus, predict the final saliency maps. Besides, due to the imbalanced distribution between the boundary pixels and nonboundary ones, the binary cross-entropy (BCE) loss is difficult to predict the pixels near the boundary. Therefore, we design a border region aware (BRA) loss to further boost the quality of boundary and nonboundary, which can guide the network to focus more on the hard pixels near the boundary by assigning different weights to different positions. Compared with 12 counterparts, experimental results on five benchmark datasets show that our method has better generalization and improves the state-of-the-art approach by 4.85% averagely in terms of the regional and boundary evaluation measures. In addition, our model is more efficient with an inference speed of 50.3 FPS when processing a 320 ×320 image. Code has been made available at https://github.com/lesonly/GFINet.
Collapse
|
7
|
Li L, Zhi T, Shi G, Yang Y, Xu L, Li Y, Guo Y. Anchor-based Knowledge Embedding for Image Aesthetics Assessment. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2023.03.058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
8
|
Zhao Z, Li X. Deformable Density Estimation via Adaptive Representation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:1134-1144. [PMID: 37022433 DOI: 10.1109/tip.2023.3240839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Crowd counting is the basic task of crowd analysis and it is of great significance in the field of public safety. Therefore, it receives more and more attention recently. The common idea is to combine the crowd counting task with convolutional neural networks to predict the corresponding density map, which is generated by filtering the dot labels with specific Gaussian kernels. Although the counting performance is promoted by the newly proposed networks, they all suffer one conjunct problem, which is due to the perspective effect, there is significant scale contrast among targets in different positions within one scene, but the existing density maps can not represent this scale change well. To address the prediction difficulties caused by target scale variation, we propose a scale-sensitive crowd density map estimation framework, which focuses on dealing with target scale change from density map generation, network design, and model training stage. It consists of the Adaptive Density Map (ADM), Deformable Density Map Decoder (DDMD), and Auxiliary Branch. To be specific, the Gaussian kernel size variates adaptively based on target size to generate ADM that contains scale information for each specific target. DDMD introduces the deformable convolution to fit the Gaussian kernel variation and boosts the model's scale sensitivity. The Auxiliary Branch guides the learning of deformable convolution offsets during the training phase. Finally, we construct experiments on different large-scale datasets. The results show the effectiveness of the proposed ADM and DDMD. Furthermore, the visualization demonstrates that deformable convolution learns the target scale variation.
Collapse
|
9
|
Li W. Aesthetic Assessment of Packaging Design Based on Con-Transformer. INTERNATIONAL JOURNAL OF E-COLLABORATION 2023. [DOI: 10.4018/ijec.316873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Different from the traditional natural images' aesthetic assessment task, the aesthetic assessment of packaging design should not only pay attention to artistic beauty, but also pay attention to functional beauty, that is, the attraction of the packaging design to consumers. In this paper, the authors propose a con-transformer packaging design aesthetic assessment method, which takes advantage of convolutional operations and self-attention mechanisms for enhanced representation learning, resulting in an effective aesthetic assessment of the packaging design images. Specifically, con-transformer integrates convolution network branch and transformer network branch to extract local representation features and global representation features of the packaging design images respectively. Finally, the fused representation features are used for aesthetic assessment. Experimental results show that the proposed method can not only effectively assess the aesthetic of packaging design images, but also be applied to the aesthetic assessment of natural images.
Collapse
Affiliation(s)
- Wei Li
- Hefei Normal University, China
| |
Collapse
|
10
|
Pang Y, Zhao X, Zhang L, Lu H. CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:892-904. [PMID: 37018701 DOI: 10.1109/tip.2023.3234702] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components.
Collapse
|
11
|
Yu H, Peng H, Huang Y, Fu J, Du H, Wang L, Ling H. Cyclic Differentiable Architecture Search. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:211-228. [PMID: 35196225 DOI: 10.1109/tpami.2022.3153065] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Differentiable ARchiTecture Search, i.e., DARTS, has drawn great attention in neural architecture search. It tries to find the optimal architecture in a shallow search network and then measures its performance in a deep evaluation network. The independent optimization of the search and evaluation networks, however, leaves a room for potential improvement by allowing interaction between the two networks. To address the problematic optimization issue, we propose new joint optimization objectives and a novel Cyclic Differentiable ARchiTecture Search framework, dubbed CDARTS. Considering the structure difference, CDARTS builds a cyclic feedback mechanism between the search and evaluation networks with introspective distillation. First, the search network generates an initial architecture for evaluation, and the weights of the evaluation network are optimized. Second, the architecture weights in the search network are further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks and thus enables the evolution of the architecture to fit the final evaluation network. The experiments and analysis on CIFAR, ImageNet and NATS-Bench [95] demonstrate the effectiveness of the proposed approach over the state-of-the-art ones. Specifically, in the DARTS search space, we achieve 97.52% top-1 accuracy on CIFAR10 and 76.3% top-1 accuracy on ImageNet. In the chain-structured search space, we achieve 78.2% top-1 accuracy on ImageNet, which is 1.1% higher than EfficientNet-B0. Our code and models are publicly available at https://github.com/microsoft/Cream.
Collapse
|
12
|
Liu JJ, Hou Q, Liu ZA, Cheng MM. PoolNet+: Exploring the Potential of Pooling for Salient Object Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:887-904. [PMID: 34982676 DOI: 10.1109/tpami.2021.3140168] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
We explore the potential of pooling techniques on the task of salient object detection by expanding its role in convolutional neural networks. In general, two pooling-based modules are proposed. A global guidance module (GGM) is first built based on the bottom-up pathway of the U-shape architecture, which aims to guide the location information of the potential salient objects into layers at different feature levels. A feature aggregation module (FAM) is further designed to seamlessly fuse the coarse-level semantic information with the fine-level features in the top-down pathway. We can progressively refine the high-level semantic features with these two modules and obtain detail enriched saliency maps. Experimental results show that our proposed approach can locate the salient objects more accurately with sharpened details and substantially improve the performance compared with the existing state-of-the-art methods. Besides, our approach is fast and can run at a speed of 53 FPS when processing a 300 ×400 image. To make our approach better applied to mobile applications, we take MobileNetV2 as our backbone and re-tailor the structure of our pooling-based modules. Our mobile version model achieves a running speed of 66 FPS yet still performs better than most existing state-of-the-art methods. To verify the generalization ability of the proposed method, we apply it to the edge detection, RGB-D salient object detection, and camouflaged object detection tasks, and our method achieves better results than the corresponding state-of-the-art methods of these three tasks. Code can be found at http://mmcheng.net/poolnet/.
Collapse
|
13
|
MENet: Lightweight Multimodality Enhancement Network for Detecting Salient Objects in RGB-Thermal Images. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2023.01.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
14
|
Wu YH, Liu Y, Xu J, Bian JW, Gu YC, Cheng MM. MobileSal: Extremely Efficient RGB-D Salient Object Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:10261-10269. [PMID: 34898430 DOI: 10.1109/tpami.2021.3134684] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The high computational cost of neural networks has prevented recent successes in RGB-D salient object detection (SOD) from benefiting real-world applications. Hence, this article introduces a novel network, MobileSal, which focuses on efficient RGB-D SOD using mobile networks for deep feature extraction. However, mobile networks are less powerful in feature representation than cumbersome networks. To this end, we observe that the depth information of color images can strengthen the feature representation related to SOD if leveraged properly. Therefore, we propose an implicit depth restoration (IDR) technique to strengthen the mobile networks' feature representation capability for RGB-D SOD. IDR is only adopted in the training phase and is omitted during testing, so it is computationally free. Besides, we propose compact pyramid refinement (CPR) for efficient multi-level feature aggregation to derive salient objects with clear boundaries. With IDR and CPR incorporated, MobileSal performs favorably against state-of-the-art methods on six challenging RGB-D SOD datasets with much faster speed (450fps for the input size of 320×320) and fewer parameters (6.5M). The code is released at https://mmcheng.net/mobilesal.
Collapse
|
15
|
Xu C, Li Q, Zhou Q, Jiang X, Yu D, Zhou Y. Asymmetric cross-modal activation network for RGB-T salient object detection. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.110047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
16
|
Tong J, Zhang G, Kong P, Rao Y, Wei Z, Cui H, Guan Q. An interpretable approach for automatic aesthetic assessment of remote sensing images. Front Comput Neurosci 2022; 16:1077439. [PMID: 36507306 PMCID: PMC9730413 DOI: 10.3389/fncom.2022.1077439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2022] [Accepted: 11/10/2022] [Indexed: 11/25/2022] Open
Abstract
The increase of remote sensing images in recent decades has resulted in their use in non-scientific fields such as environmental protection, education, and art. In this situation, we need to focus on the aesthetic assessment of remote sensing, which has received little attention in research. While according to studies on human brain's attention mechanism, certain areas of an image can trigger visual stimuli during aesthetic evaluation. Inspired by this, we used convolutional neural network (CNN), a deep learning model resembling the human neural system, for image aesthetic assessment. So we propose an interpretable approach for automatic aesthetic assessment of remote sensing images. Firstly, we created the Remote Sensing Aesthetics Dataset (RSAD). We collected remote sensing images from Google Earth, designed the four evaluation criteria of remote sensing image aesthetic quality-color harmony, light and shadow, prominent theme, and visual balance-and then labeled the samples based on expert photographers' judgment on the four evaluation criteria. Secondly, we feed RSAD into the ResNet-18 architecture for training. Experimental results show that the proposed method can accurately identify visually pleasing remote sensing images. Finally, we provided a visual explanation of aesthetic assessment by adopting Gradient-weighted Class Activation Mapping (Grad-CAM) to highlight the important image area that influenced model's decision. Overall, this paper is the first to propose and realize automatic aesthetic assessment of remote sensing images, contributing to the non-scientific applications of remote sensing and demonstrating the interpretability of deep-learning based image aesthetic evaluation.
Collapse
Affiliation(s)
- Jingru Tong
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
| | - Guo Zhang
- State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China,*Correspondence: Guo Zhang,
| | - Peijie Kong
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
| | - Yu Rao
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
| | - Zhengkai Wei
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China
| | - Hao Cui
- State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China
| | - Qing Guan
- State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China
| |
Collapse
|
17
|
Risnandar. DeSa COVID-19: Deep salient COVID-19 image-based quality assessment. JOURNAL OF KING SAUD UNIVERSITY. COMPUTER AND INFORMATION SCIENCES 2022; 34:9501-9512. [PMID: 38620925 PMCID: PMC8647162 DOI: 10.1016/j.jksuci.2021.11.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 11/14/2021] [Accepted: 11/16/2021] [Indexed: 04/17/2024]
Abstract
This study offers an advanced method to evaluate the coronavirus disease 2019 (COVID-19) image quality. The salient COVID-19 image map is incorporated with the deep convolutional neural network (DCNN), namely DeSa COVID-19, which exerts the n-convex method for the full-reference image quality assessment (FR-IQA). The glaring outcomes substantiate that DeSa COVID-19 and the recommended DCNN architecture can convey a remarkable accomplishment on the COVID-chestxray and the COVID-CT datasets, respectively. The salient COVID-19 image map is also gauged in the minuscule COVID-19 image patches. The exploratory results attest that DeSa COVID-19 and the recommended DCNN methods are very good accomplishment compared with other advanced methods on COVID-chestxray and COVID-CT datasets, respectively. The recommended DCNN also acquires the enhanced outgrowths faced with several advanced full-reference-medical-image-quality-assessment (FR-MIQA) techniques in the fast fading (FF), blocking artifact (BA), white noise Gaussian (WG), JPEG, and JPEG2000 (JP2K) in the distorted and undistorted COVID-19 images. The Spearman's rank order correlation coefficient (SROCC) and the linear correlation coefficient (LCC) appraise the recommended DCNN and DeSa COVID-19 fulfillment which are compared the recent FR-MIQA methods. The DeSa COVID-19 evaluation outshines 2.63 % and 2.62 % higher compared the recommended DCNN, and 28.53 % and 29.01 % esteem all of advanced FR-MIQAs methods on SROCC and LCC measures, respectively. The shift add operations of trigonometric, logarithmic, and exponential functions are mowed down in the computational complexity of the DeSa COVID-19 and the recommended DCNN. The DeSa COVID-19 more superior the recommended DCNN and also the other recent full-reference medical image quality assessment methods.
Collapse
Affiliation(s)
- Risnandar
- The Intelligent Systems Research Group, School of Computing, Telkom University, Jl. Telekomunikasi No. 1, Terusan Buahbatu-Dayeuhkolot, Bandung, West Java 40257 Indonesia
- The Computer Vision Research Group, the Research Center for Informatics, Indonesian Institute of Sciences (LIPI) and the National Research and Innovation Agency (BRIN), Republic of Indonesia, Jl. Sangkuriang/Cisitu No.21/154D LIPI Building 20th, 3rd Floor, Bandung, West Java, 40135 Indonesia
| |
Collapse
|
18
|
Audio–visual collaborative representation learning for Dynamic Saliency Prediction. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
19
|
Transformers and CNNs Fusion Network for Salient Object Detection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.10.081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
20
|
Bao W, Yang C, Wen S, Zeng M, Guo J, Zhong J, Xu X. A Novel Adaptive Deskewing Algorithm for Document Images. SENSORS (BASEL, SWITZERLAND) 2022; 22:7944. [PMID: 36298294 PMCID: PMC9610931 DOI: 10.3390/s22207944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 10/06/2022] [Accepted: 10/11/2022] [Indexed: 06/16/2023]
Abstract
Document scanning often suffers from skewing, which may seriously influence the efficiency of Optical Character Recognition (OCR). Therefore, it is necessary to correct the skewed document before document image information analysis. In this article, we propose a novel adaptive deskewing algorithm for document images, which mainly includes Skeleton Line Detection (SKLD), Piecewise Projection Profile (PPP), Morphological Clustering (MC), and the image classification method. The image type is determined firstly based on the image's layout feature. Thus, adaptive correcting is applied to deskew the image according to its type. Our method maintains high accuracy on the Document Image Skew Estimation Contest (DISEC'2013) and PubLayNet datasets, which achieved 97.6% and 80.1% accuracy, respectively. Meanwhile, extensive experiments show the superiority of the proposed algorithm.
Collapse
Affiliation(s)
- Wuzhida Bao
- School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China
| | - Cihui Yang
- School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China
| | - Shiping Wen
- Australian AI Institute, University of Technology Sydney, Sydney, NSW 2007, Australia
| | - Mengjie Zeng
- School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China
| | - Jianyong Guo
- School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China
| | - Jingting Zhong
- School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China
| | - Xingmiao Xu
- School of Information Engineering, Nanchang Hangkong University, Nanchang 330063, China
| |
Collapse
|
21
|
Exploring class-agnostic pixels for scribble-supervised high-resolution salient object detection. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07915-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
22
|
Jia XZ, DongYe CL, Peng YJ, Zhao WX, Liu TD. MRBENet: A Multiresolution Boundary Enhancement Network for Salient Object Detection. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:7780756. [PMID: 36262601 PMCID: PMC9576351 DOI: 10.1155/2022/7780756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 09/08/2022] [Accepted: 09/24/2022] [Indexed: 11/17/2022]
Abstract
Salient Object Detection (SOD) simulates the human visual perception in locating the most attractive objects in the images. Existing methods based on convolutional neural networks have proven to be highly effective for SOD. However, in some cases, these methods cannot satisfy the need of both accurately detecting intact objects and maintaining their boundary details. In this paper, we present a Multiresolution Boundary Enhancement Network (MRBENet) that exploits edge features to optimize the location and boundary fineness of salient objects. We incorporate a deeper convolutional layer into the backbone network to extract high-level semantic features and indicate the location of salient objects. Edge features of different resolutions are extracted by a U-shaped network. We designed a Feature Fusion Module (FFM) to fuse edge features and salient features. Feature Aggregation Module (FAM) based on spatial attention performs multiscale convolutions to enhance salient features. The FFM and FAM allow the model to accurately locate salient objects and enhance boundary fineness. Extensive experiments on six benchmark datasets demonstrate that the proposed method is highly effective and improves the accuracy of salient object detection compared with state-of-the-art methods.
Collapse
Affiliation(s)
- Xing-Zhao Jia
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
| | - Chang-Lei DongYe
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
| | - Yan-Jun Peng
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
| | - Wen-Xiu Zhao
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
| | - Tian-De Liu
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
| |
Collapse
|
23
|
Boosting Few-shot visual recognition via saliency-guided complementary attention. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
24
|
Yan T, Huang X, Zhao Q. Hierarchical Superpixel Segmentation by Parallel CRTrees Labeling. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4719-4732. [PMID: 35797313 DOI: 10.1109/tip.2022.3187563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This paper proposes a hierarchical superpixel segmentation by representing an image as a hierarchy of 1-nearest neighbor (1-NN) graphs with pixels/superpixels denoting the graph vertices. The 1-NN graphs are built from the pixel/superpixel adjacent matrices to ensure connectivity. To determine the next-level superpixel hierarchy, inspired by FINCH clustering, the weakly connected components (WCCs) of the 1-NN graph are labeled as superpixels. We reveal that the WCCs of a 1-NN graph consist of a forest of cycle-root-trees (CRTrees). The forest-like structure inspires us to propose a two-stage parallel CRTrees labeling which first links the child vertices to the cycle-roots and then labels all the vertices by the cycle-roots. We also propose an inter-inner superpixel distance penalization and a Lab color lightness penalization base on the property that the distance of a CRTree decreases monotonically from the child to root vertices. Experiments show the parallel CRTrees labeling is several times faster than recent advanced sequential and parallel connected components labeling algorithms. The proposed hierarchical superpixel segmentation has comparable performance to the best performer ETPS (state-of-the-arts) on the BSDS500, NYUV2, and Fash datasets. At the same time, it can achieve 200FPS for 480P video streams.
Collapse
|
25
|
García-Pulido JA, Pajares G, Dormido S. UAV Landing Platform Recognition Using Cognitive Computation Combining Geometric Analysis and Computer Vision Techniques. Cognit Comput 2022. [DOI: 10.1007/s12559-021-09962-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
AbstractUnmanned aerial vehicles (UAVs) are excellent tools with extensive demand. During the last phase of landing, they require additional support to that of GPS. This can be achieved through the UAV’s perception system based on its on-board camera and intelligence, and with which decisions can be made as to how to land on a platform (target). A cognitive computation approach is proposed to recognize this target that has been specifically designed to translate human reasoning into computational procedures by computing two probabilities of detection which are combined considering the fuzzy set theory for proper decision-making. The platform design is based on: (1) spectral information in the visible range which are uncommon colors in the UAV’s operating environments (indoors and outdoors) and (2) specific figures in the foreground, which allow partial perception of each figure. We exploit color image properties from specific-colored figures embedded on the platform and which are identified by applying image processing and pattern recognition techniques, including Euclidean Distance Smart Geometric Analysis, to identify the platform in a very efficient and reliable manner. The test strategy uses 800 images captured with a smartphone onboard a quad-rotor UAV. The results verify the proposed method outperforms existing strategies, especially those that do not use color information. Platform recognition is also possible even with only a partial view of the target, due to image capture under adverse conditions. This demonstrates the effectiveness and robustness of the proposed cognitive computing-based perception system.
Collapse
|
26
|
Zhang Q, Shi Y, Zhang X, Zhang L. Residual attentive feature learning network for salient object detection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.06.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
27
|
Wang W, Lai Q, Fu H, Shen J, Ling H, Yang R. Salient Object Detection in the Deep Learning Era: An In-Depth Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:3239-3259. [PMID: 33434124 DOI: 10.1109/tpami.2021.3051099] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
As an essential problem in computer vision, salient object detection (SOD) has attracted an increasing amount of research attention over the years. Recent advances in SOD are predominantly led by deep learning-based solutions (named deep SOD). To enable in-depth understanding of deep SOD, in this paper, we provide a comprehensive survey covering various aspects, ranging from algorithm taxonomy to unsolved issues. In particular, we first review deep SOD algorithms from different perspectives, including network architecture, level of supervision, learning paradigm, and object-/instance-level detection. Following that, we summarize and analyze existing SOD datasets and evaluation metrics. Then, we benchmark a large group of representative SOD models, and provide detailed analyses of the comparison results. Moreover, we study the performance of SOD algorithms under different attribute settings, which has not been thoroughly explored previously, by constructing a novel SOD dataset with rich attribute annotations covering various salient object types, challenging factors, and scene categories. We further analyze, for the first time in the field, the robustness of SOD models to random input perturbations and adversarial attacks. We also look into the generalization and difficulty of existing SOD datasets. Finally, we discuss several open issues of SOD and outline future research directions. All the saliency prediction maps, our constructed dataset with annotations, and codes for evaluation are publicly available at https://github.com/wenguanwang/SODsurvey.
Collapse
|
28
|
Zeng H, Li L, Cao Z, Zhang L. Grid Anchor Based Image Cropping: A New Benchmark and An Efficient Model. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:1304-1319. [PMID: 32931429 DOI: 10.1109/tpami.2020.3024207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Image cropping aims to improve the composition as well as aesthetic quality of an image by removing extraneous content from it. Most of the existing image cropping databases provide only one or several human-annotated bounding boxes as the groundtruths, which can hardly reflect the non-uniqueness and flexibility of image cropping in practice. The employed evaluation metrics such as intersection-over-union cannot reliably reflect the real performance of a cropping model, either. This work revisits the problem of image cropping, and presents a grid anchor based formulation by considering the special properties and requirements (e.g., local redundancy, content preservation, aspect ratio) of image cropping. Our formulation reduces the searching space of candidate crops from millions to no more than ninety. Consequently, a grid anchor based cropping benchmark is constructed, where all crops of each image are annotated and more reliable evaluation metrics are defined. To meet the practical demands of robust performance and high efficiency, we also design an effective and lightweight cropping model. By simultaneously considering the region of interest and region of discard, and leveraging multi-scale information, our model can robustly output visually pleasing crops for images of different scenes. With less than 2.5M parameters, our model runs at a speed of 200 FPS on one single GTX 1080Ti GPU and 12 FPS on one i7-6800K CPU. The code is available at: https://github.com/HuiZeng/Grid-Anchor-based-Image-Cropping-Pytorch.
Collapse
|
29
|
Sun B, Ren Y, Lu X. Semisupervised Consistent Projection Metric Learning for Person Reidentification. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:738-747. [PMID: 32310811 DOI: 10.1109/tcyb.2020.2979262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Person reidentification is a hot topic in the computer vision field. Many efforts have been paid on modeling a discriminative distance metric. However, existing metric-learning-based methods are a lack of generalization. In this article, the poor generalization of the metric model is argued as the biased estimation problem that the independent identical distribution hypothesis is not valid. The verification experimental result shows that there is a sharp difference between the training and test samples in the metric subspace. A semisupervised consistent projection metric-learning method is proposed to ease the biased estimation problem by learning a consistent constrained metric subspace in which the identified pairs are forced to follow the distribution of the positive training pairs. First, a semisupervised method is proposed to generate potential matching pairs from the k -nearest neighbors of test samples. The potential matching pairs are used to estimate the distances' distribution center of the positive test pairs. Second, the metric subspace is improved by forcing this estimation to be close to the center of the positive training pairs. Finally, extensive experiments are conducted on five datasets and the results demonstrate that the proposed method reaches the best performance, especially on the rank-1 identification rate.
Collapse
|
30
|
Ye M, Shen J, Zhang X, Yuen PC, Chang SF. Augmentation Invariant and Instance Spreading Feature for Softmax Embedding. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:924-939. [PMID: 32750841 DOI: 10.1109/tpami.2020.3013379] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Deep embedding learning plays a key role in learning discriminative feature representations, where the visually similar samples are pulled closer and dissimilar samples are pushed away in the low-dimensional embedding space. This paper studies the unsupervised embedding learning problem by learning such a representation without using any category labels. This task faces two primary challenges: mining reliable positive supervision from highly similar fine-grained classes, and generalizing to unseen testing categories. To approximate the positive concentration and negative separation properties in category-wise supervised learning, we introduce a data augmentation invariant and instance spreading feature using the instance-wise supervision. We also design two novel domain-agnostic augmentation strategies to further extend the supervision in feature space, which simulates the large batch training using a small batch size and the augmented features. To learn such a representation, we propose a novel instance-wise softmax embedding, which directly perform the optimization over the augmented instance features with the binary discrmination softmax encoding. It significantly accelerates the learning speed with much higher accuracy than existing methods, under both seen and unseen testing categories. The unsupervised embedding performs well even without pre-trained network over samples from fine-grained categories. We also develop a variant using category-wise supervision, namely category-wise softmax embedding, which achieves competitive performance over the state-of-of-the-arts, without using any auxiliary information or restrict sample mining.
Collapse
|
31
|
Wang J, Yang Q, Yang S, Chai X, Zhang W. Dual-path Processing Network for High-resolution Salient Object Detection. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02971-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
32
|
Review of Visual Saliency Prediction: Development Process from Neurobiological Basis to Deep Models. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app12010309] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The human attention mechanism can be understood and simulated by closely associating the saliency prediction task to neuroscience and psychology. Furthermore, saliency prediction is widely used in computer vision and interdisciplinary subjects. In recent years, with the rapid development of deep learning, deep models have made amazing achievements in saliency prediction. Deep learning models can automatically learn features, thus solving many drawbacks of the classic models, such as handcrafted features and task settings, among others. Nevertheless, the deep models still have some limitations, for example in tasks involving multi-modality and semantic understanding. This study focuses on summarizing the relevant achievements in the field of saliency prediction, including the early neurological and psychological mechanisms and the guiding role of classic models, followed by the development process and data comparison of classic and deep saliency prediction models. This study also discusses the relationship between the model and human vision, as well as the factors that cause the semantic gaps, the influences of attention in cognitive research, the limitations of the saliency model, and the emerging applications, to provide new saliency predictions for follow-up work and the necessary help and advice.
Collapse
|
33
|
Jiang Y, Xu S, Fan H, Qian J, Luo W, Zhen S, Tao Y, Sun J, Lin H. ALA-Net: Adaptive Lesion-Aware Attention Network for 3D Colorectal Tumor Segmentation. IEEE TRANSACTIONS ON MEDICAL IMAGING 2021; 40:3627-3640. [PMID: 34197319 DOI: 10.1109/tmi.2021.3093982] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Accurate and reliable segmentation of colorectal tumors and surrounding colorectal tissues on 3D magnetic resonance images has critical importance in preoperative prediction, staging, and radiotherapy. Previous works simply combine multilevel features without aggregating representative semantic information and without compensating for the loss of spatial information caused by down-sampling. Therefore, they are vulnerable to noise from complex backgrounds and suffer from misclassification and target incompleteness-related failures. In this paper, we address these limitations with a novel adaptive lesion-aware attention network (ALA-Net) which explicitly integrates useful contextual information with spatial details and captures richer feature dependencies based on 3D attention mechanisms. The model comprises two parallel encoding paths. One of these is designed to explore global contextual features and enlarge the receptive field using a recurrent strategy. The other captures sharper object boundaries and the details of small objects that are lost in repeated down-sampling layers. Our lesion-aware attention module adaptively captures long-range semantic dependencies and highlights the most discriminative features, improving semantic consistency and completeness. Furthermore, we introduce a prediction aggregation module to combine multiscale feature maps and to further filter out irrelevant information for precise voxel-wise prediction. Experimental results show that ALA-Net outperforms state-of-the-art methods and inherently generalizes well to other 3D medical images segmentation tasks, providing multiple benefits in terms of target completeness, reduction of false positives, and accurate detection of ambiguous lesion regions.
Collapse
|
34
|
|
35
|
Xiao N, Zhang L, Xu X, Guo T, Ma H. Label Disentangled Analysis for unsupervised visual domain adaptation. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107309] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
36
|
Learning spatial-channel regularization jointly with correlation filter for visual tracking. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.04.146] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
37
|
|
38
|
Lin G, Zhao S, Shen J. Video person re-identification with global statistic pooling and self-attention distillation. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.05.111] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
39
|
Multi-level dictionary learning for fine-grained images categorization with attention model. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.07.147] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
40
|
Salient object segmentation for image composition: A case study of group dinner photo. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.06.127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
41
|
|
42
|
|
43
|
Wang W, Shen J, Lu X, Hoi SCH, Ling H. Paying Attention to Video Object Pattern Understanding. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:2413-2428. [PMID: 31940522 DOI: 10.1109/tpami.2020.2966453] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper conducts a systematic study on the role of visual attention in video object pattern understanding. By elaborately annotating three popular video segmentation datasets (DAVIS 16, Youtube-Objects, and SegTrack V2) with dynamic eye-tracking data in the unsupervised video object segmentation (UVOS) setting. For the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgments during dynamic, task-driven viewing. Such novel observations provide an in-depth insight of the underlying rationale behind video object pattens. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major advantages: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on four popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance compared with state-of-the-arts and enjoys fast processing speed (10 fps on a single GPU). Our collected eye-tracking data and algorithm implementations have been made publicly available at https://github.com/wenguanwang/AGS.
Collapse
|
44
|
Jiang PT, Zhang CB, Hou Q, Cheng MM, Wei Y. LayerCAM: Exploring Hierarchical Class Activation Maps for Localization. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5875-5888. [PMID: 34156941 DOI: 10.1109/tip.2021.3089943] [Citation(s) in RCA: 90] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The class activation maps are generated from the final convolutional layer of CNN. They can highlight discriminative object regions for the class of interest. These discovered object regions have been widely used for weakly-supervised tasks. However, due to the small spatial resolution of the final convolutional layer, such class activation maps often locate coarse regions of the target objects, limiting the performance of weakly-supervised tasks that need pixel-accurate object locations. Thus, we aim to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately. In this paper, by rethinking the relationships between the feature maps and their corresponding gradients, we propose a simple yet effective method, called LayerCAM. It can produce reliable class activation maps for different layers of CNN. This property enables us to collect object localization information from coarse (rough spatial localization) to fine (precise fine-grained details) levels. We further integrate them into a high-quality class activation map, where the object-related pixels can be better highlighted. To evaluate the quality of the class activation maps produced by LayerCAM, we apply them to weakly-supervised object localization and semantic segmentation. Experiments demonstrate that the class activation maps generated by our method are more effective and reliable than those by the existing attention methods. The code will be made publicly available.
Collapse
|
45
|
Zhang D, Zheng Z, Li M, Liu R. CSART: Channel and spatial attention-guided residual learning for real-time object tracking. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.11.046] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
46
|
|
47
|
Yang F, Li X, Shen J. MSB-FCN: Multi-Scale Bidirectional FCN for Object Skeleton Extraction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:2301-2312. [PMID: 33226943 DOI: 10.1109/tip.2020.3038483] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The performance of state-of-the-art object skeleton detection (OSD) methods have been greatly boosted by Convolutional Neural Networks (CNNs). However, the most existing CNN-based OSD methods rely on a 'skip-layer' structure where low-level and high-level features are combined to gather multi-level contextual information. Unfortunately, as shallow features tend to be noisy and lack semantic knowledge, they will cause errors and inaccuracy. Therefore, in order to improve the accuracy of object skeleton detection, we propose a novel network architecture, the Multi-Scale Bidirectional Fully Convolutional Network (MSB-FCN), to better gather and enhance multi-scale high-level contextual information. The advantage is that only deep features are used to construct multi-scale feature representations along with a bidirectional structure for better capturing contextual knowledge. This enables the proposed MSB-FCN to learn semantic-level information from different sub-regions. Moreover, we introduce dense connections into the bidirectional structure to ensure that the learning process at each scale can directly encode information from all other scales. An attention pyramid is also integrated into our MSB-FCN to dynamically control information propagation and reduce unreliable features. Extensive experiments on various benchmarks demonstrate that the proposed MSB-FCN achieves significant improvements over the state-of-the-art algorithms.
Collapse
|
48
|
Khalili A, Bouchachia H. An Information Theory Approach to Aesthetic Assessment of Visual Patterns. ENTROPY 2021; 23:e23020153. [PMID: 33513789 PMCID: PMC7912568 DOI: 10.3390/e23020153] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/05/2020] [Revised: 01/08/2021] [Accepted: 01/17/2021] [Indexed: 12/03/2022]
Abstract
The question of beauty has inspired philosophers and scientists for centuries. Today, the study of aesthetics is an active research topic in fields as diverse as computer science, neuroscience, and psychology. Measuring the aesthetic appeal of images is beneficial for many applications. In this paper, we will study the aesthetic assessment of simple visual patterns. The proposed approach suggests that aesthetically appealing patterns are more likely to deliver a higher amount of information over multiple levels in comparison with less aesthetically appealing patterns when the same amount of energy is used. The proposed approach is evaluated using two datasets; the results show that the proposed approach is more accurate in classifying aesthetically appealing patterns compared to some related approaches that use different complexity measures.
Collapse
|
49
|
Wang W, Shen J, Xie J, Cheng MM, Ling H, Borji A. Revisiting Video Saliency Prediction in the Deep Learning Era. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:220-237. [PMID: 31247542 DOI: 10.1109/tpami.2019.2924417] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.
Collapse
|
50
|
Li Z, Lang C, Liew JH, Li Y, Hou Q, Feng J. Cross-Layer Feature Pyramid Network for Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:4587-4598. [PMID: 33872147 DOI: 10.1109/tip.2021.3072811] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Feature pyramid network (FPN) based models, which fuse the semantics and salient details in a progressive manner, have been proven highly effective in salient object detection. However, it is observed that these models often generate saliency maps with incomplete object structures or unclear object boundaries, due to the indirect information propagation among distant layers that makes such fusion structure less effective. In this work, we propose a novel Cross-layer Feature Pyramid Network (CFPN), in which direct cross-layer communication is enabled to improve the progressive fusion in salient object detection. Specifically, the proposed network first aggregates multi-scale features from different layers into feature maps that have access to both the high- and low- level information. Then, it distributes the aggregated features to all the involved layers to gain access to richer context. In this way, the distributed features per layer own both semantics and salient details from all other layers simultaneously, and suffer reduced loss of important information during the progressive feature fusion. At last, CFPN fuses the distributed features of each layer stage-by-stage. This way, the high-level features that contain context useful for locating complete objects are preserved until the final output layer, and the low-level features that contain spatial structure details are embedded into each layer to preserve spatial structural details. Extensive experimental results over six widely used salient object detection benchmarks and with three popular backbones clearly demonstrate that CFPN can accurately locate fairly complete salient regions and effectively segment the object boundaries.
Collapse
|