1
|
Wang K, Chen Y, Bo D, Wang S. A novel multi-user collaborative cognitive radio spectrum sensing model: Based on a CNN-LSTM model. PLoS One 2025; 20:e0316291. [PMID: 39813223 PMCID: PMC11734992 DOI: 10.1371/journal.pone.0316291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Accepted: 12/09/2024] [Indexed: 01/18/2025] Open
Abstract
Cognitive Radio (CR) technology enables wireless devices to learn about their surrounding spectrum environment through sensing capabilities, thereby facilitating efficient spectrum utilization without interfering with the normal operation of licensed users. This study aims to enhance spectrum sensing in multi-user cooperative cognitive radio systems by leveraging a hybrid model that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. A novel multi-user cooperative spectrum sensing model is developed, utilizing CNN's local feature extraction capability and LSTM's advantage in handling sequential data to optimize sensing accuracy and efficiency. Furthermore, a multi-head self-attention mechanism is incorporated to improve information flow, enhancing the model's adaptability and robustness in dynamic and complex environments. Simulation experiments were conducted to quantitatively evaluate the performance of the proposed model. The results demonstrate that the CNN-LSTM model achieves low sensing error rates across various numbers of secondary users (16, 24, 32, 40, 48), with a particularly low sensing error of 9.9658% under the 32-user configuration. Additionally, when comparing the sensing errors of different deep learning models, the proposed model consistently outperformed others, showing a 12% lower sensing error under low-power conditions (100 mW). This study successfully develops a CNN-LSTM-based cooperative spectrum sensing model for multi-user cognitive radio systems, significantly improving sensing accuracy and efficiency. By integrating CNN and LSTM technologies, the model not only enhances sensing performance but also improves the handling of long-term dependencies in time-series data, offering a novel technical approach and theoretical support for cognitive radio research. Moreover, the introduction of the multi-head self-attention mechanism further optimizes the model's adaptability to complex environments, demonstrating significant potential for practical applications.
Collapse
Affiliation(s)
- Kai Wang
- School of Electronic Information Engineering, Inner Mongolia University, Hohhot, Inner Mongolia, China
| | - Yangyang Chen
- School of Electronic Information Engineering, Inner Mongolia University, Hohhot, Inner Mongolia, China
| | - Dan Bo
- School of Electronic Information Engineering, Inner Mongolia University, Hohhot, Inner Mongolia, China
| | - Shubin Wang
- School of Electronic Information Engineering, Inner Mongolia University, Hohhot, Inner Mongolia, China
| |
Collapse
|
2
|
Merkle P, Winken M, Pfaff J, Schwarz H, Marpe D, Wiegand T. Spatio-Temporal Convolutional Neural Network for Enhanced Inter Prediction in Video Coding. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:4738-4752. [PMID: 39186411 DOI: 10.1109/tip.2024.3446228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/28/2024]
Abstract
This paper presents a convolutional neural network (CNN)-based enhancement to inter prediction in Versatile Video Coding (VVC). Our approach aims at improving the prediction signal of inter blocks with a residual CNN that incorporates spatial and temporal reference samples. It is motivated by the theoretical consideration that neural network-based methods have a higher degree of signal adaptivity than conventional signal processing methods and that spatially neighboring reference samples have the potential to improve the prediction signal by adapting it to the reconstructed signal in its immediate vicinity. We show that adding a polyphase decomposition stage to the CNN results in a significantly better trade-off between computational complexity and coding performance. Incorporating spatial reference samples in the inter prediction process is challenging: The fact that the input of the CNN for one block may depend on the output of the CNN for preceding blocks prohibits parallel processing. We solve this by introducing a novel signal plane that contains specifically constrained reference samples, enabling parallel decoding while maintaining a high compression efficiency. Overall, experimental results show average bit rate savings of 4.07% and 3.47% for the random access (RA) and low-delay B (LB) configurations of the JVET common test conditions, respectively.
Collapse
|
3
|
Banerjee S, Mandal S, Jesubalan NG, Jain R, Rathore AS. NIR spectroscopy-CNN-enabled chemometrics for multianalyte monitoring in microbial fermentation. Biotechnol Bioeng 2024; 121:1803-1819. [PMID: 38390805 DOI: 10.1002/bit.28681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 02/09/2024] [Accepted: 02/12/2024] [Indexed: 02/24/2024]
Abstract
As the biopharmaceutical industry looks to implement Industry 4.0, the need for rapid and robust analytical characterization of analytes has become a pressing priority. Spectroscopic tools, like near-infrared (NIR) spectroscopy, are finding increasing use for real-time quantitative analysis. Yet detection of multiple low-concentration analytes in microbial and mammalian cell cultures remains an ongoing challenge, requiring the selection of carefully calibrated, resilient chemometrics for each analyte. The convolutional neural network (CNN) is a puissant tool for processing complex data and making it a potential approach for automatic multivariate spectral processing. This work proposes an inception module-based two-dimensional (2D) CNN approach (I-CNN) for calibrating multiple analytes using NIR spectral data. The I-CNN model, coupled with orthogonal partial least squares (PLS) preprocessing, converts the NIR spectral data into a 2D data matrix, after which the critical features are extracted, leading to model development for multiple analytes. Escherichia coli fermentation broth was taken as a case study, where calibration models were developed for 23 analytes, including 20 amino acids, glucose, lactose, and acetate. The I-CNN model result statistics depicted an average R2 values of prediction 0.90, external validation data set 0.86 and significantly lower root mean square error of prediction values ∼0.52 compared to conventional regression models like PLS. Preprocessing steps were applied to I-CNN models to evaluate any augmentation in prediction performance. Finally, the model reliability was assessed via real-time process monitoring and comparison with offline analytics. The proposed I-CNN method is systematic and novel in extracting distinctive spectral features from a multianalyte bioprocess data set and could be adapted to other complex cell culture systems requiring rapid quantification using spectroscopy.
Collapse
Affiliation(s)
- Shantanu Banerjee
- Department of Chemical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India
| | - Shyamapada Mandal
- Department of Chemical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India
| | - Naveen G Jesubalan
- School of Interdisciplinary Research, Indian Institute of Technology Delhi, New Delhi, Delhi, India
| | - Rijul Jain
- Department of Chemical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India
| | - Anurag S Rathore
- Department of Chemical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India
- School of Interdisciplinary Research, Indian Institute of Technology Delhi, New Delhi, Delhi, India
| |
Collapse
|
4
|
Choi YJ, Lee YW, Kim J, Jeong SY, Choi JS, Kim BG. Attention-Based Bi-Prediction Network for Versatile Video Coding (VVC) over 5G Network. SENSORS (BASEL, SWITZERLAND) 2023; 23:2631. [PMID: 36904838 PMCID: PMC10007134 DOI: 10.3390/s23052631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Revised: 02/16/2023] [Accepted: 02/22/2023] [Indexed: 06/18/2023]
Abstract
As the demands of various network-dependent services such as Internet of things (IoT) applications, autonomous driving, and augmented and virtual reality (AR/VR) increase, the fifthgeneration (5G) network is expected to become a key communication technology. The latest video coding standard, versatile video coding (VVC), can contribute to providing high-quality services by achieving superior compression performance. In video coding, inter bi-prediction serves to improve the coding efficiency significantly by producing a precise fused prediction block. Although block-wise methods, such as bi-prediction with CU-level weight (BCW), are applied in VVC, it is still difficult for the linear fusion-based strategy to represent diverse pixel variations inside a block. In addition, a pixel-wise method called bi-directional optical flow (BDOF) has been proposed to refine bi-prediction block. However, the non-linear optical flow equation in BDOF mode is applied under assumptions, so this method is still unable to accurately compensate various kinds of bi-prediction blocks. In this paper, we propose an attention-based bi-prediction network (ABPN) to substitute for the whole existing bi-prediction methods. The proposed ABPN is designed to learn efficient representations of the fused features by utilizing an attention mechanism. Furthermore, the knowledge distillation (KD)- based approach is employed to compress the size of the proposed network while keeping comparable output as the large model. The proposed ABPN is integrated into the VTM-11.0 NNVC-1.0 standard reference software. When compared with VTM anchor, it is verified that the BD-rate reduction of the lightweighted ABPN can be up to 5.89% and 4.91% on Y component under random access (RA) and low delay B (LDB), respectively.
Collapse
Affiliation(s)
- Young-Ju Choi
- Department of IT Engineering, Sookmyung Women’s University, Seoul 04310, Republic of Korea
| | - Young-Woon Lee
- Department of Computer Engineering, Sunmoon University, Asan 31460, Republic of Korea
| | - Jongho Kim
- Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea
| | - Se Yoon Jeong
- Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea
| | - Jin Soo Choi
- Media Coding Research Section, Electronics and Telecommunications Research Institute, Daejeon 34129, Republic of Korea
| | - Byung-Gyu Kim
- Department of IT Engineering, Sookmyung Women’s University, Seoul 04310, Republic of Korea
| |
Collapse
|
5
|
Paul S, Norkin A, Bovik AC. Self-Supervised Learning of Perceptually Optimized Block Motion Estimates for Video Compression. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:617-630. [PMID: 37015500 DOI: 10.1109/tip.2022.3231082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Block based motion estimation is integral to inter prediction processes performed in hybrid video codecs. Prevalent block matching based methods that are used to compute block motion vectors (MVs) rely on computationally intensive search procedures. They also suffer from the aperture problem, which tends to worsen as the block size is reduced. Moreover, the block matching criteria used in typical codecs do not account for the resulting levels of perceptual quality of the motion compensated pictures that are created upon decoding. Towards achieving the elusive goal of perceptually optimized motion estimation, we propose a search-free block motion estimation framework using a multi-stage convolutional neural network, which is able to conduct motion estimation on multiple block sizes simultaneously, using a triplet of frames as input. This composite block translation network (CBT-Net) is trained in a self-supervised manner on a large database that we created from publicly available uncompressed video content. We deploy the multi-scale structural similarity (MS-SSIM) loss function to optimize the perceptual quality of the motion compensated predicted frames. Our experimental results highlight the computational efficiency of our proposed model relative to conventional block matching based motion estimation algorithms, for comparable prediction errors. Further, when used to perform inter prediction in AV1, the MV predictions of the perceptually optimized model result in average Bjøntegaard-delta rate (BD-rate) improvements of -1.73% and -1.31% with respect to the MS-SSIM and Video Multi-Method Assessment Fusion (VMAF) quality metrics, respectively, as compared to the block matching based motion estimation system employed in the SVT-AV1 encoder.
Collapse
|
6
|
Sun W, He X, Ren C, Xiong S, Chen H. A quality enhancement network with coding priors for constant bit rate video coding. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.110010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
7
|
Zhang Y, Xiao M, Zheng WX, Cao J. Large-Scale Neural Networks With Asymmetrical Three-Ring Structure: Stability, Nonlinear Oscillations, and Hopf Bifurcation. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:9893-9904. [PMID: 34587105 DOI: 10.1109/tcyb.2021.3109566] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A large number of experiments have proved that the ring structure is a common phenomenon in neural networks. Nevertheless, a few works have been devoted to studying the neurodynamics of networks with only one ring. Little is known about the dynamics of neural networks with multiple rings. Consequently, the study of neural networks with multiring structure is of more practical significance. In this article, a class of high-dimensional neural networks with three rings and multiple delays is proposed. Such network has an asymmetric structure, which entails that each ring has a different number of neurons. Simultaneously, three rings share a common node. Selecting the time delay as the bifurcation parameter, the stability switches are ascertained and the sufficient condition of Hopf bifurcation is derived. It is further revealed that both the number of neurons in the ring and the total number of neurons have obvious influences on the stability and bifurcation of the neural network. Ultimately, some numerical simulations are given to illustrate our qualitative results and to underpin the discussion.
Collapse
|
8
|
Lei J, Zhang Z, Pan Z, Liu D, Liu X, Chen Y, Ling N. Disparity-Aware Reference Frame Generation Network for Multiview Video Coding. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4515-4526. [PMID: 35727785 DOI: 10.1109/tip.2022.3183436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Multiview video coding (MVC) aims to compress the multiview video through the elimination of video redundancies, where the quality of the reference frame directly affects the compression efficiency. In this paper, we propose a deep virtual reference frame generation method based on a disparity-aware reference frame generation network (DAG-Net) to transform the disparity relationship between different viewpoints and generate a more reliable reference frame. The proposed DAG-Net consists of a multi-level receptive field module, a disparity-aware alignment module, and a fusion reconstruction module. First, a multi-level receptive field module is designed to enlarge the receptive field, and extract the multi-scale deep features of the temporal and inter-view reference frames. Then, a disparity-aware alignment module is proposed to learn the disparity relationship, and perform disparity shift on the inter-view reference frame to align it with the temporal reference frame. Finally, a fusion reconstruction module is utilized to fuse the complementary information and generate a more reliable virtual reference frame. Experiments demonstrate that the proposed reference frame generation method achieves superior performance for multiview video coding.
Collapse
|
9
|
Liu C, Sun H, Katto J, Zeng X, Fan Y. QA-Filter: A QP-Adaptive Convolutional Neural Network Filter for Video Coding. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:3032-3045. [PMID: 35385382 DOI: 10.1109/tip.2022.3152627] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Convolutional neural network (CNN)-based filters have achieved great success in video coding. However, in most previous works, individual models were needed for each quantization parameter (QP) band, which is impractical due to limited storage resources. To explore this, our work consists of two parts. First, we propose a frequency and spatial QP-adaptive mechanism (FSQAM), which can be directly applied to the (vanilla) convolution to help any CNN filter handle different quantization noise. From the frequency domain, a FQAM that introduces the quantization step (Qstep) into the convolution is proposed. When the quantization noise increases, the ability of the CNN filter to suppress noise improves. Moreover, SQAM is further designed to compensate for the FQAM from the spatial domain. Second, based on FSQAM, a QP-adaptive CNN filter called QA-Filter that can be used under a wide range of QP is proposed. By factorizing the mixed features to high-frequency and low-frequency parts with the pair of pooling and upsampling operations, the QA-Filter and FQAM can promote each other to obtain better performance. Compared to the H.266/VVC baseline, average 5.25% and 3.84% BD-rate reductions for luma are achieved by QA-Filter with default all-intra (AI) and random-access (RA) configurations, respectively. Additionally, an up to 9.16% BD-rate reduction is achieved on the luma of sequence BasketballDrill. Besides, FSQAM achieves measurably better BD-rate performance compared with the previous QP map method.
Collapse
|
10
|
A nonlocal HEVC in-loop filter using CNN-based compression noise estimation. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03259-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
11
|
|
12
|
Schiopu I, Munteanu A. Deep Learning Post-Filtering Using Multi-Head Attention and Multiresolution Feature Fusion for Image and Intra-Video Quality Enhancement. SENSORS 2022; 22:s22041353. [PMID: 35214252 PMCID: PMC8963040 DOI: 10.3390/s22041353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 02/02/2022] [Accepted: 02/03/2022] [Indexed: 11/25/2022]
Abstract
The paper proposes a novel post-filtering method based on convolutional neural networks (CNNs) for quality enhancement of RGB/grayscale images and video sequences. The lossy images are encoded using common image codecs, such as JPEG and JPEG2000. The video sequences are encoded using previous and ongoing video coding standards, high-efficiency video coding (HEVC) and versatile video coding (VVC), respectively. A novel deep neural network architecture is proposed to estimate fine refinement details for full-, half-, and quarter-patch resolutions. The proposed architecture is built using a set of efficient processing blocks designed based on the following concepts: (i) the multi-head attention mechanism for refining the feature maps, (ii) the weight sharing concept for reducing the network complexity, and (iii) novel block designs of layer structures for multiresolution feature fusion. The proposed method provides substantial performance improvements compared with both common image codecs and video coding standards. Experimental results on high-resolution images and standard video sequences show that the proposed post-filtering method provides average BD-rate savings of 31.44% over JPEG and 54.61% over HEVC (x265) for RGB images, Y-BD-rate savings of 26.21% over JPEG and 15.28% over VVC (VTM) for grayscale images, and 15.47% over HEVC and 14.66% over VVC for video sequences.
Collapse
|
13
|
Ding D, Wang W, Tong J, Gao X, Liu Z, Fang Y. Biprediction-Based Video Quality Enhancement via Learning. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1207-1220. [PMID: 32554335 DOI: 10.1109/tcyb.2020.2998481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Convolutional neural networks (CNNs)-based video quality enhancement generally employs optical flow for pixelwise motion estimation and compensation, followed by utilizing motion-compensated frames and jointly exploring the spatiotemporal correlation across frames to facilitate the enhancement. This method, called the optical-flow-based method (OPT), usually achieves high accuracy at the expense of high computational complexity. In this article, we develop a new framework, referred to as biprediction-based multiframe video enhancement (PMVE), to achieve a one-pass enhancement procedure. PMVE designs two networks, that is, the prediction network (Pred-net) and the frame-fusion network (FF-net), to implement the two steps of synthesization and fusion, respectively. Specifically, the Pred-net leverages frame pairs to synthesize the so-called virtual frames (VFs) for those low-quality frames (LFs) through biprediction. Afterward, the slowly fused FF-net takes the VFs as the input to extract the correlation across the VFs and the related LFs, to obtain an enhanced version of those LFs. Such a framework allows PMVE to leverage the cross-correlation between successive frames for enhancement, hence capable of achieving high accuracy performance. Meanwhile, PMVE effectively avoids the explicit operations of motion estimation and compensation, hence greatly reducing the complexity compared to OPT. The experimental results demonstrate that the peak signal-to-noise ratio (PSNR) performance of PMVE is fully on par with that of OPT while its computational complexity is only 1% of OPT. Compared with other state-of-the-art methods in the literature, PMVE is also confirmed to achieve superior performance in both objective quality and visual quality at a reasonable complexity level. For instance, PMVE can surpass its best counterpart method by up to 0.42 dB in PSNR.
Collapse
|
14
|
Ding D, Gao X, Tang C, Ma Z. Neural Reference Synthesis for Inter Frame Coding. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 31:773-787. [PMID: 34932476 DOI: 10.1109/tip.2021.3134465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
This work proposes the neural reference synthesis (NRS) to generate high-fidelity reference block for motion estimation and motion compensation (MEMC) in inter frame coding. The NRS is comprised of two submodules: one for reconstruction enhancement and the other for reference generation. Although numerous methods have been developed in the past for these two submodules using either handcrafted rules or deep convolutional neural network (CNN) models, they basically deal with them separately, resulting in limited coding gains. By contrast, the NRS proposes to optimize them collaboratively. It first develops two CNN-based models, namely EnhNet and GenNet. The EnhNet only uses spatial correlations within the current frame for reconstruction enhancement and the GenNet is then augmented by further aggregating temporal correlations across multiple frames for reference synthesis. However, a direct concatenation of EnhNet and GenNet without considering the complex temporal reference dependency across inter frames would implicitly induce iterative CNN processing and cause the data overfitting problem, leading to visually-disturbing artifacts and oversmoothed pixels. To tackle this problem, the NRS applies a new training strategy to coordinate the EnhNet and GenNet for more robust and generalizable models, and also devises a lightweight multi-level R-D (rate-distortion) selection policy for the encoder to adaptively choose reference blocks generated from the proposed NRS model or conventional coding process. Our NRS not only offers state-of-the-art coding gains, e.g., >10% BD-Rate (Bjøntegaard Delta Rate) reduction against the High Efficiency Video Coding (HEVC) anchor for a variety of common test video sequences encoded at a wide bit range in both low-delay and random access settings, but also greatly reduces the complexity relative to existing learning-based methods by utilizing more lightweight DNNs. All models are made publicly accessible at https://github.com/IVC-Projects/NRS for reproducible research.
Collapse
|
15
|
Xu Q, Jiang X, Sun T, Kot AC. Detection of HEVC double compression with non-aligned GOP structures via inter-frame quality degradation analysis. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.04.092] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
16
|
Huang Z, Sun J, Guo X, Shang M. Adaptive Deep Reinforcement Learning-Based In-Loop Filter for VVC. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5439-5451. [PMID: 34081581 DOI: 10.1109/tip.2021.3084345] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Deep learning-based in-loop filters have recently demonstrated great improvement for both coding efficiency and subjective quality in video coding. However, most existing deep learning-based in-loop filters tend to develop a sophisticated model in exchange for good performance, and they employ a single network structure to all reconstructed samples, which lack sufficient adaptiveness to the various video content, limiting their performances to some extent. In contrast, this paper proposes an adaptive deep reinforcement learning-based in-loop filter (ARLF) for versatile video coding (VVC). Specifically, we treat the filtering as a decision-making process and employ an agent to select an appropriate network by leveraging recent advances in deep reinforcement learning. To this end, we develop a lightweight backbone and utilize it to design a network set S containing networks with different complexities. Then a simple but efficient agent network is designed to predict the optimal network from S , which makes the model adaptive to various video contents. To improve the robustness of our model, a two-stage training scheme is further proposed to train the agent and tune the network set. The coding tree unit (CTU) is seen as the basic unit for the in-loop filtering processing. A CTU level control flag is applied in the sense of rate-distortion optimization (RDO). Extensive experimental results show that our ARLF approach obtains on average 2.17%, 2.65%, 2.58%, 2.51% under all-intra, low-delay P, low-delay, and random access configurations, respectively. Compared with other deep learning-based methods, the proposed approach can achieve better performance with low computation complexity.
Collapse
|
17
|
Attention Networks for the Quality Enhancement of Light Field Images. SENSORS 2021; 21:s21093246. [PMID: 34067191 PMCID: PMC8125823 DOI: 10.3390/s21093246] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Revised: 04/30/2021] [Accepted: 05/01/2021] [Indexed: 11/17/2022]
Abstract
In this paper, we propose a novel filtering method based on deep attention networks for the quality enhancement of light field (LF) images captured by plenoptic cameras and compressed using the High Efficiency Video Coding (HEVC) standard. The proposed architecture was built using efficient complex processing blocks and novel attention-based residual blocks. The network takes advantage of the macro-pixel (MP) structure, specific to LF images, and processes each reconstructed MP in the luminance (Y) channel. The input patch is represented as a tensor that collects, from an MP neighbourhood, four Epipolar Plane Images (EPIs) at four different angles. The experimental results on a common LF image database showed high improvements over HEVC in terms of the structural similarity index (SSIM), with an average Y-Bjøntegaard Delta (BD)-rate savings of 36.57%, and an average Y-BD-PSNR improvement of 2.301 dB. Increased performance was achieved when the HEVC built-in filtering methods were skipped. The visual results illustrate that the enhanced image contains sharper edges and more texture details. The ablation study provides two robust solutions to reduce the inference time by 44.6% and the network complexity by 74.7%. The results demonstrate the potential of attention networks for the quality enhancement of LF images encoded by HEVC.
Collapse
|
18
|
Sun W, He X, Chen H, Sheriff RE, Xiong S. A quality enhancement framework with noise distribution characteristics for high efficiency video coding. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.06.048] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
19
|
Duan LY, Liu J, Yang W, Huang T, Gao W. Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; PP:8680-8695. [PMID: 32857694 DOI: 10.1109/tip.2020.3016485] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Video coding, which targets to compress and reconstruct the whole frame, and feature compression, which only preserves and transmits the most critical information, stand at two ends of the scale. That is, one is with compactness and efficiency to serve for machine vision, and the other is with full fidelity, bowing to human perception. The recent endeavors in imminent trends of video compression, e.g. deep learning based coding tools and end-to-end image/video coding, and MPEG-7 compact feature descriptor standards, i.e. Compact Descriptors for Visual Search and Compact Descriptors for Video Analysis, promote the sustainable and fast development in their own directions, respectively. In this paper, thanks to booming AI technology, e.g. prediction and generation models, we carry out exploration in the new area, Video Coding for Machines (VCM), arising from the emerging MPEG standardization efforts1. Towards collaborative compression and intelligent analytics, VCM attempts to bridge the gap between feature coding for machine vision and video coding for human vision. Aligning with the rising Analyze then Compress instance Digital Retina, the definition, formulation, and paradigm of VCM are given first. Meanwhile, we systematically review state-of-the-art techniques in video compression and feature compression from the unique perspective of MPEG standardization, which provides the academic and industrial evidence to realize the collaborative compression of video and feature streams in a broad range of AI applications. Finally, we come up with potential VCM solutions, and the preliminary results have demonstrated the performance and efficiency gains. Further direction is discussed as well.
Collapse
|
20
|
Paul S, Norkin A, Bovik AC. Speeding up VP9 Intra Encoder with Hierarchical Deep Learning Based Partition Prediction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; PP:8134-8148. [PMID: 32746243 DOI: 10.1109/tip.2020.3011270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In VP9 video codec, the sizes of blocks are decided during encoding by recursively partitioning 64×64 superblocks using rate-distortion optimization (RDO). This process is computationally intensive because of the combinatorial search space of possible partitions of a superblock. Here, we propose a deep learning based alternative framework to predict the intra-mode superblock partitions in the form of a four-level partition tree, using a hierarchical fully convolutional network (H-FCN). We created a large database of VP9 superblocks and the corresponding partitions to train an H-FCN model, which was subsequently integrated with the VP9 encoder to reduce the intra-mode encoding time. The experimental results establish that our approach speeds up intra-mode encoding by 69.7% on average, at the expense of a 1.71% increase in the Bjøntegaard-Delta bitrate (BD-rate). While VP9 provides several built-in speed levels which are designed to provide faster encoding at the expense of decreased rate-distortion performance, we find that our model is able to outperform the fastest recommended speed level of the reference VP9 encoder for the good quality intra encoding configuration, in terms of both speedup and BD-rate.
Collapse
|
21
|
Klopp JP, Chen LG, Chien SY. Utilising Low Complexity CNNs to Lift Non-Local Redundancies in Video Coding. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020:1-1. [PMID: 32386156 DOI: 10.1109/tip.2020.2991525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Digital media is ubiquitous and produced in ever-growing quantities. This necessitates a constant evolution of compression techniques, especially for video, in order to maintain efficient storage and transmission. In this work, we aim at exploiting non-local redundancies in video data that remain difficult to erase for conventional video codecs We design convolutional neural networks with a particular emphasis on low memory and computational footprint. The parameters of those networks are trained on the fly, at encoding time, to predict the residual signal from the decoded video signal. After the training process has converged, the parameters are compressed and signalled as part of the code of the underlying video codec. The method can be applied to any existing video codec to increase coding gains while its low computational footprint allows for an application under resource-constrained conditions. Building on top of High Efficiency Video Coding, we achieve coding gains similar to those of pretrained denoising CNNs while only requiring about 1% of their computational complexity Through extensive experiments, we provide insights into the effectiveness of our network design decisions. In addition, we demonstrate that our algorithm delivers stable performance under conditions met in practical video compression: our algorithm performs without significant performance loss on very long random access segments (up to 256 frames) and with moderate performance drops can even be applied to single frames in high-resolution low delay settings.
Collapse
|
22
|
|