1
|
Hadizadeh H, Bajić IV. Learned scalable video coding for humans and machines. EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING 2024; 2024:41. [PMID: 39553832 PMCID: PMC11564357 DOI: 10.1186/s13640-024-00657-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 11/03/2024] [Indexed: 11/19/2024]
Abstract
Video coding has traditionally been developed to support services such as video streaming, videoconferencing, digital TV, and so on. The main intent was to enable human viewing of the encoded content. However, with the advances in deep neural networks (DNNs), encoded video is increasingly being used for automatic video analytics performed by machines. In applications such as automatic traffic monitoring, analytics such as vehicle detection, tracking and counting, would run continuously, while human viewing could be required occasionally to review potential incidents. To support such applications, a new paradigm for video coding is needed that will facilitate efficient representation and compression of video for both machine and human use in a scalable manner. In this manuscript, we introduce an end-to-end learnable video codec that supports a machine vision task in its base layer, while its enhancement layer, together with the base layer, supports input reconstruction for human viewing. The proposed system is constructed based on the concept of conditional coding to achieve better compression gains. Comprehensive experimental evaluations conducted on four standard video datasets demonstrate that our framework outperforms both state-of-the-art learned and conventional video codecs in its base layer, while maintaining comparable performance on the human vision task in its enhancement layer.
Collapse
Affiliation(s)
- Hadi Hadizadeh
- School of Engineering Science, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6 Canada
| | - Ivan V. Bajić
- School of Engineering Science, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6 Canada
| |
Collapse
|
2
|
Azizian B, Bajic IV. Privacy-Preserving Autoencoder for Collaborative Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:4937-4951. [PMID: 39236122 DOI: 10.1109/tip.2024.3451938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/07/2024]
Abstract
Privacy is a crucial concern in collaborative machine vision where a part of a Deep Neural Network (DNN) model runs on the edge, and the rest is executed on the cloud. In such applications, the machine vision model does not need the exact visual content to perform its task. Taking advantage of this potential, private information could be removed from the data insofar as it does not significantly impair the accuracy of the machine vision system. In this paper, we present an autoencoder-style network integrated within an object detection pipeline, which generates a latent representation of the input image that preserves task-relevant information while removing private information. Our approach employs an adversarial training strategy that not only removes private information from the bottleneck of the autoencoder but also promotes improved compression efficiency for feature channels coded by conventional codecs like VVC-Intra. We assess the proposed system using a realistic evaluation framework for privacy, directly measuring face and license plate recognition accuracy. Experimental results show that our proposed method is able to reduce the bitrate significantly at the same object detection accuracy compared to coding the input images directly, while keeping the face and license plate recognition accuracy on the images recovered from the bottleneck features low, implying strong privacy protection. Our code is available at https://github.com/bardia-az/ppa-code.
Collapse
|
3
|
Tian Y, Lu G, Yan Y, Zhai G, Chen L, Gao Z. A Coding Framework and Benchmark Towards Low-Bitrate Video Understanding. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:5852-5872. [PMID: 38376963 DOI: 10.1109/tpami.2024.3367879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/22/2024]
Abstract
Video compression is indispensable to most video analysis systems. Despite saving the transportation bandwidth, it also deteriorates downstream video understanding tasks, especially at low-bitrate settings. To systematically investigate this problem, we first thoroughly review the previous methods, revealing that three principles, i.e., task-decoupled, label-free, and data-emerged semantic prior, are critical to a machine-friendly coding framework but are not fully satisfied so far. In this paper, we propose a traditional-neural mixed coding framework that simultaneously fulfills all these principles, by taking advantage of both traditional codecs and neural networks (NNs). On one hand, the traditional codecs can efficiently encode the pixel signal of videos but may distort the semantic information. On the other hand, highly non-linear NNs are proficient in condensing video semantics into a compact representation. The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved w.r.t. the coding procedure, which is spontaneously learned from unlabeled data in a self-supervised manner. The videos collaboratively decoded from two streams (codec and NN) are of rich semantics, as well as visually photo-realistic, empirically boosting several mainstream downstream video analysis task performances without any post-adaptation procedure. Furthermore, by introducing the attention mechanism and adaptive modeling scheme, the video semantic modeling ability of our approach is further enhanced. Fianlly, we build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach. All codes, data, and models will be open-sourced for facilitating future research.
Collapse
|
4
|
Sheng X, Li L, Liu D, Li H. VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:4579-4596. [PMID: 38252583 DOI: 10.1109/tpami.2024.3356548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Almost all digital videos are coded into compact representations before being transmitted. Such compact representations need to be decoded back to pixels before being displayed to humans and - as usual - before being enhanced/analyzed by machine vision algorithms. Intuitively, it is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. Therefore, we propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis, thereby being versatile for both human and machine vision. Our VNVC framework has a feature-based compression loop. In the loop, one frame is encoded into compact representations and decoded to an intermediate feature that is obtained before performing reconstruction. The intermediate feature can be used as reference in motion compensation and motion estimation through feature-based temporal context mining and cross-domain motion encoder-decoder to compress the following frames. The intermediate feature is directly fed into video reconstruction, video enhancement, and video analysis networks to evaluate its effectiveness. The evaluation shows that our framework with the intermediate feature achieves high compression efficiency for video reconstruction and satisfactory task performances with lower complexities.
Collapse
|
5
|
Huang CH, Wu JL. Unveiling the Future of Human and Machine Coding: A Survey of End-to-End Learned Image Compression. ENTROPY (BASEL, SWITZERLAND) 2024; 26:357. [PMID: 38785606 PMCID: PMC11120525 DOI: 10.3390/e26050357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 04/18/2024] [Accepted: 04/23/2024] [Indexed: 05/25/2024]
Abstract
End-to-end learned image compression codecs have notably emerged in recent years. These codecs have demonstrated superiority over conventional methods, showcasing remarkable flexibility and adaptability across diverse data domains while supporting new distortion losses. Despite challenges such as computational complexity, learned image compression methods inherently align with learning-based data processing and analytic pipelines due to their well-suited internal representations. The concept of Video Coding for Machines has garnered significant attention from both academic researchers and industry practitioners. This concept reflects the growing need to integrate data compression with computer vision applications. In light of these developments, we present a comprehensive survey and review of lossy image compression methods. Additionally, we provide a concise overview of two prominent international standards, MPEG Video Coding for Machines and JPEG AI. These standards are designed to bridge the gap between data compression and computer vision, catering to practical industry use cases.
Collapse
Affiliation(s)
- Chen-Hsiu Huang
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan;
| | | |
Collapse
|
6
|
Duan Z, Hossain AF, He J, Zhu F. Balancing the Encoder and Decoder Complexity in Image Compression for Classification. RESEARCH SQUARE 2024:rs.3.rs-4002168. [PMID: 38746384 PMCID: PMC11092870 DOI: 10.21203/rs.3.rs-4002168/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
This paper presents a study on the computational complexity of coding for machines, with a focus on image coding for classification. We first conduct a comprehensive set of experiments to analyze the size of the encoder (which encodes images to bitstreams), the size of the decoder (which decodes bitstreams and predicts class labels), and their impact on the rate-accuracy trade-off in compression for classification. Through empirical investigation, we demonstrate a complementary relationship between the encoder size and the decoder size, i.e., it is better to employ a large encoder with a small decoder and vice versa. Motivated by this relationship, we introduce a feature compression-based method for efficient image compression for classification. By compressing features at various layers of a neural network-based image classification model, our method achieves adjustable rate, accuracy, and encoder (or decoder) size using a single model. Experimental results on ImageNet classification show that our method achieves competitive results with existing methods while being much more flexible. The code will be made publicly available.
Collapse
Affiliation(s)
- Zhihao Duan
- Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, 47907, IN, U.S.A
| | - Adnan Faisal Hossain
- Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, 47907, IN, U.S.A
| | - Jiangpeng He
- Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, 47907, IN, U.S.A
| | - Fengqing Zhu
- Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, 47907, IN, U.S.A
| |
Collapse
|
7
|
Charpenay N, Le Treust M, Roumy A. Side Information Design in Zero-Error Coding for Computing. ENTROPY (BASEL, SWITZERLAND) 2024; 26:338. [PMID: 38667892 PMCID: PMC11049120 DOI: 10.3390/e26040338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 04/08/2024] [Accepted: 04/09/2024] [Indexed: 04/28/2024]
Abstract
We investigate the zero-error coding for computing problems with encoder side information. An encoder provides access to a source X and is furnished with side information g(Y). It communicates with a decoder that possesses side information Y and aims to retrieve f(X,Y) with zero probability of error, where f and g are assumed to be deterministic functions. In previous work, we determined a condition that yields an analytic expression for the optimal rate R*(g); in particular, it covers the case where PX,Y is full support. In this article, we review this result and study the side information design problem, which consists of finding the best trade-offs between the quality of the encoder's side information g(Y) and R*(g). We construct two greedy algorithms that give an achievable set of points in the side information design problem, based on partition refining and coarsening. One of them runs in polynomial time.
Collapse
Affiliation(s)
| | - Maël Le Treust
- Univ. Rennes, CNRS, Inria, IRISA UMR 6074, F-35000 Rennes, France
| | - Aline Roumy
- INRIA Rennes, Campus de Beaulieu, F-35000 Rennes cedex, France;
| |
Collapse
|
8
|
Duan Z, Ma Z, Zhu F. Unified Architecture Adaptation for Compressed Domain Semantic Inference. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY : A PUBLICATION OF THE CIRCUITS AND SYSTEMS SOCIETY 2023; 33:4108-4121. [PMID: 37547669 PMCID: PMC10403241 DOI: 10.1109/tcsvt.2023.3240391] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Advances in both lossy image compression and semantic content understanding have been greatly fueled by deep learning techniques, yet these two tasks have been developed separately for the past decades. In this work, we address the problem of directly executing semantic inference from quantized latent features in the deep compressed domain without pixel reconstruction. Although different methods have been proposed for this problem setting, they either are restrictive to a specific architecture, or are sub-optimal in terms of compressed domain task accuracy. In contrast, we propose a lightweight, plug-and-play solution which is generally compliant with popular learned image coders and deep vision models, making it attractive to vast applications. Our method adapts prevalent pixel domain neural models that are deployed for various vision tasks to directly accept quantized latent features (other than pixels). We further suggest training the compressed domain model by transferring knowledge from its corresponding pixel domain counterpart. Experiments show that our method is compliant with popular learned image coders and vision task models. Under fair comparison, our approach outperforms a baseline method by a) more than 3% top-1 accuracy for compressed domain classification, and b) more than 7% mIoU for compressed domain semantic segmentation, at various data rates.
Collapse
Affiliation(s)
- Zhihao Duan
- Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana 47907, U.S.A
| | - Zhan Ma
- School of Electronic Science and Engineering, Nanjing University, Nanjing, Jiangsu 210093, China
| | - Fengqing Zhu
- Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana 47907, U.S.A
| |
Collapse
|
9
|
Measuring self-regulated learning and the role of AI: Five years of research using multimodal multichannel data. COMPUTERS IN HUMAN BEHAVIOR 2023. [DOI: 10.1016/j.chb.2022.107540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
10
|
Momin MS, Sufian A, Barman D, Dutta P, Dong M, Leo M. In-Home Older Adults' Activity Pattern Monitoring Using Depth Sensors: A Review. SENSORS (BASEL, SWITZERLAND) 2022; 22:9067. [PMID: 36501769 PMCID: PMC9735577 DOI: 10.3390/s22239067] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 11/10/2022] [Accepted: 11/15/2022] [Indexed: 06/17/2023]
Abstract
The global population is aging due to many factors, including longer life expectancy through better healthcare, changing diet, physical activity, etc. We are also witnessing various frequent epidemics as well as pandemics. The existing healthcare system has failed to deliver the care and support needed to our older adults (seniors) during these frequent outbreaks. Sophisticated sensor-based in-home care systems may offer an effective solution to this global crisis. The monitoring system is the key component of any in-home care system. The evidence indicates that they are more useful when implemented in a non-intrusive manner through different visual and audio sensors. Artificial Intelligence (AI) and Computer Vision (CV) techniques may be ideal for this purpose. Since the RGB imagery-based CV technique may compromise privacy, people often hesitate to utilize in-home care systems which use this technology. Depth, thermal, and audio-based CV techniques could be meaningful substitutes here. Due to the need to monitor larger areas, this review article presents a systematic discussion on the state-of-the-art using depth sensors as primary data-capturing techniques. We mainly focused on fall detection and other health-related physical patterns. As gait parameters may help to detect these activities, we also considered depth sensor-based gait parameters separately. The article provides discussions on the topic in relation to the terminology, reviews, a survey of popular datasets, and future scopes.
Collapse
Affiliation(s)
- Md Sarfaraz Momin
- Department of Computer Science, Kaliachak College, University of Gour Banga, Malda 732101, India
- Department of Computer & System Sciences, Visva-Bharati University, Bolpur 731235, India
| | - Abu Sufian
- Department of Computer Science, University of Gour Banga, Malda 732101, India
| | - Debaditya Barman
- Department of Computer & System Sciences, Visva-Bharati University, Bolpur 731235, India
| | - Paramartha Dutta
- Department of Computer & System Sciences, Visva-Bharati University, Bolpur 731235, India
| | - Mianxiong Dong
- Department of Science and Informatics, Muroran Institute of Technology, Muroran 050-8585, Hokkaido, Japan
| | - Marco Leo
- National Research Council of Italy, Institute of Applied Sciences and Intelligent Systems, 73100 Lecce, Italy
| |
Collapse
|
11
|
Hu Y, Yang W, Ma Z, Liu J. Learning End-to-End Lossy Image Compression: A Benchmark. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:4194-4211. [PMID: 33705308 DOI: 10.1109/tpami.2021.3065339] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Image compression is one of the most fundamental techniques and commonly used applications in the image and video processing field. Earlier methods built a well-designed pipeline, and efforts were made to improve all modules of the pipeline by handcrafted tuning. Later, tremendous contributions were made, especially when data-driven methods revitalized the domain with their excellent modeling capacities and flexibility in incorporating newly designed modules and constraints. Despite great progress, a systematic benchmark and comprehensive analysis of end-to-end learned image compression methods are lacking. In this paper, we first conduct a comprehensive literature survey of learned image compression methods. The literature is organized based on several aspects to jointly optimize the rate-distortion performance with a neural network, i.e., network architecture, entropy model and rate control. We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and provide insights into their historical development routes. With this survey, the main challenges of image compression methods are revealed, along with opportunities to address the related issues with recent advanced learning methods. This analysis provides an opportunity to take a further step towards higher-efficiency image compression. By introducing a coarse-to-fine hyperprior model for entropy estimation and signal reconstruction, we achieve improved rate-distortion performance, especially on high-resolution images. Extensive benchmark experiments demonstrate the superiority of our model in rate-distortion performance and time complexity on multi-core CPUs and GPUs.
Collapse
|
12
|
Object Detection-Based Video Compression. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12094525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Video compression is designed to provide good subjective image quality, even at a high-compression ratio. In addition, video quality metrics have been used to show the results can maintain a high Peak Signal-to-Noise Ratio (PSNR), even at high compression. However, there are many difficulties in object recognition on the decoder side due to the low image quality caused by high compression. Accordingly, providing good image quality for the detected objects is necessary for the given total bitrate for utilizing object detection in a video decoder. In this paper, object detection-based video compression by the encoder and decoder is proposed that allocates lower quantization parameters to the detected-object regions and higher quantization parameters to the background. Therefore, better image quality is obtained for the detected objects on the decoder side. Object detection-based video compression consists of two types: Versatile Video Coding (VVC) and object detection. In this paper, the decoder performs the decompression process by receiving the bitstreams in the object-detection decoder and the VVC decoder. In the proposed method, the VVC encoder and decoder are processed based on the information obtained from object detection. In a random access (RA) configuration, the average Bjøntegaard Delta (BD)-rates of Y, Cb, and Cr increased by 2.33%, 2.67%, and 2.78%, respectively. In an All Intra (AI) configuration, the average BD-rates of Y, Cb, and Cr increased by 0.59%, 1.66%, and 1.42%, respectively. In an RA configuration, the averages of ΔY-PSNR, ΔCb-PSNR, and ΔCr-PSNR for the object-detected areas improved to 0.17%, 0.23%, and 0.04%, respectively. In an AI configuration, the averages of ΔY-PSNR, ΔCb-PSNR, and ΔCr-PSNR for the object-detected areas improved to 0.71%, 0.30%, and 0.30%, respectively. Subjective image quality was also improved in the object-detected areas.
Collapse
|
13
|
Choi H, Bajic IV. Scalable Image Coding for Humans and Machines. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2739-2754. [PMID: 35324440 DOI: 10.1109/tip.2022.3160602] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
At present, and increasingly so in the future, much of the captured visual content will not be seen by humans. Instead, it will be used for automated machine vision analytics and may require occasional human viewing. Examples of such applications include traffic monitoring, visual surveillance, autonomous navigation, and industrial machine vision. To address such requirements, we develop an end-to-end learned image codec whose latent space is designed to support scalability from simpler to more complicated tasks. The simplest task is assigned to a subset of the latent space (the base layer), while more complicated tasks make use of additional subsets of the latent space, i.e., both the base and enhancement layer(s). For the experiments, we establish a 2-layer and a 3-layer model, each of which offers input reconstruction for human vision, plus machine vision task(s), and compare them with relevant benchmarks. The experiments show that our scalable codecs offer 37%-80% bitrate savings on machine vision tasks compared to best alternatives, while being comparable to state-of-the-art image codecs in terms of input reconstruction.
Collapse
|
14
|
Yu WY, Po LM, Xiong J, Zhao Y, Xian P. ShaTure: Shape and Texture Deformation for Human Pose and Attribute Transfer. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:2541-2556. [PMID: 35275819 DOI: 10.1109/tip.2022.3157146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this paper, we present a novel end-to-end pose transfer framework to transform a source person image to an arbitrary pose with controllable attributes. Due to the spatial misalignment caused by occlusions and multi-viewpoints, maintaining high-quality shape and texture appearance is still a challenging problem for pose-guided person image synthesis. Without considering the deformation of shape and texture, existing solutions on controllable pose transfer still cannot generate high-fidelity texture for the target image. To solve this problem, we design a new image reconstruction decoder - ShaTure which formulates shape and texture in a braiding manner. It can interchange discriminative features in both feature-level space and pixel-level space so that the shape and texture can be mutually fine-tuned. In addition, we develop a new bottleneck module - Adaptive Style Selector (AdaSS) Module which can enhance the multi-scale feature extraction capability by self-recalibration of the feature map through channel-wise attention. Both quantitative and qualitative results show that the proposed framework has superiority compared with the state-of-the-art human pose and attribute transfer methods. Detailed ablation studies report the effectiveness of each contribution, which proves the robustness and efficacy of the proposed framework.
Collapse
|
15
|
Thomos N, Maugey T, Toni L. Machine Learning for Multimedia Communications. SENSORS (BASEL, SWITZERLAND) 2022; 22:819. [PMID: 35161566 PMCID: PMC8840624 DOI: 10.3390/s22030819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 01/12/2022] [Accepted: 01/14/2022] [Indexed: 11/17/2022]
Abstract
Machine learning is revolutionizing the way multimedia information is processed and transmitted to users. After intensive and powerful training, some impressive efficiency/accuracy improvements have been made all over the transmission pipeline. For example, the high model capacity of the learning-based architectures enables us to accurately model the image and video behavior such that tremendous compression gains can be achieved. Similarly, error concealment, streaming strategy or even user perception modeling have widely benefited from the recent learning-oriented developments. However, learning-based algorithms often imply drastic changes to the way data are represented or consumed, meaning that the overall pipeline can be affected even though a subpart of it is optimized. In this paper, we review the recent major advances that have been proposed all across the transmission chain, and we discuss their potential impact and the research challenges that they raise.
Collapse
Affiliation(s)
- Nikolaos Thomos
- School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK
| | | | - Laura Toni
- Department of Electrical & Electrical Engineering, University College London (UCL), London WC1E 6AE, UK;
| |
Collapse
|
16
|
Yan N, Gao C, Liu D, Li H, Li L, Wu F. SSSIC: Semantics-to-Signal Scalable Image Coding With Learned Structural Representations. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:8939-8954. [PMID: 34699359 DOI: 10.1109/tip.2021.3121131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We address the requirement of image coding for joint human-machine vision, i.e., the decoded image serves both human observation and machine analysis/understanding. Previously, human vision and machine vision have been extensively studied by image (signal) compression and (image) feature compression, respectively. Recently, for joint human-machine vision, several studies have been devoted to joint compression of images and features, but the correlation between images and features is still unclear. We identify the deep network as a powerful toolkit for generating structural image representations. From the perspective of information theory, the deep features of an image naturally form an entropy decreasing series: a scalable bitstream is achieved by compressing the features backward from a deeper layer to a shallower layer until culminating with the image signal. Moreover, we can obtain learned representations by training the deep network for a given semantic analysis task or multiple tasks and acquire deep features that are related to semantics. With the learned structural representations, we propose SSSIC, a framework to obtain an embedded bitstream that can be either partially decoded for semantic analysis or fully decoded for human vision. We implement an exemplar SSSIC scheme using coarse-to-fine image classification as the driven semantic analysis task. We also extend the scheme for object detection and instance segmentation tasks. The experimental results demonstrate the effectiveness of the proposed SSSIC framework and establish that the exemplar scheme achieves higher compression efficiency than separate compression of images and features.
Collapse
|
17
|
Liu K, Liu D, Li L, Yan N, Li H. Semantics-to-Signal Scalable Image Compression with Learned Revertible Representations. Int J Comput Vis 2021. [DOI: 10.1007/s11263-021-01491-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|