Towards image compression with perfect realism at ultra-low bitrates
Authors: Marlene Careil, Matthew J. Muckley, Jakob Verbeek, Stéphane Lathuilière
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We dub our model Per Co for perceptual compression , and compare it to stateof-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is more than an order of magnitude smaller than those considered in most prior work, compressing a 512 × 768 Kodak image with less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID. As predicted by rate-distortion-perception theory, visual quality is less dependent on the bitrate than previous methods. Experimentally we observe significantly improved FID and KID scores compared to competing methods. Moreover, we find that FID and KID are much more stable across bitrates than other methods, which aligns with our goal of image compression with perfect realism. In sum, our contributions are as follows: We develop a novel diffusion model called Per Co for image compression that is conditioned on both a vector-quantized latent image representation and a textual image description. We obtain realistic reconstructions at bitrates as low as 0.003 bits per pixel, significantly improving over previous work, see Figure 1. We obtain state-of-the-art FID and KID performance on the MS-COCO 30k dataset; and observe no significant degradation of FID when reducing the bitrate. Datasets. For evaluation, we use the Kodak dataset (Franzen, 1999) as well as MS-COCO 30k. On COCO we evaluate at resolution 256 × 256 by selecting the same images from the 2014 validation set (Lin et al., 2014) as Hoogeboom et al. (2023) and Agustsson et al. (2023). We evaluate at resolution 512 × 512 on the 2017 training set (Caesar et al., 2018), which is the same resolution used for evaluation by Lei et al. (2023), and use captions and label maps for some metrics. Metrics. To quantify image quality we use FID (Heusel et al., 2017) and KID (Bi´nkowski et al., 2018), which match feature distributions between sets of original images and their reconstructions. |
| Researcher Affiliation | Collaboration | Marlene Careil1,2, Matthew J. Muckley1, Jakob Verbeek1, St ephane Lathuili ere2 1Meta AI 2LTCI, T el ecom Paris, IP Paris {marlenec,mmuckley,jjverbeek}@meta.com stephane.lathuiliere@telecom-paris.fr |
| Pseudocode | No | The paper includes a diagram (Figure 2: Overview of Per Co) but no structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to several third-party libraries and external codebases (e.g., 'https://github.com/tensorflow/compression/tree/master/models/hific', 'https://github.com/lucidrains/vector-quantize-pytorch', 'https://github.com/GaParmar/clean-fid', 'https://github.com/richzhang/PerceptualSimilarity', 'https://github.com/facebookresearch/NeuralCompression'), but there is no explicit statement or link providing the source code for the Per Co model described in this paper. |
| Open Datasets | Yes | We train the hyper encoder and finetune the diffusion model on Open Images (Kuznetsova et al., 2020), similar to Lei et al. (2023) and Muckley et al. (2023). For evaluation, we use the Kodak dataset (Franzen, 1999) as well as MS-COCO 30k. On COCO we evaluate at resolution 256 × 256 by selecting the same images from the 2014 validation set (Lin et al., 2014) as Hoogeboom et al. (2023) and Agustsson et al. (2023). We evaluate at resolution 512 × 512 on the 2017 training set (Caesar et al., 2018), which is the same resolution used for evaluation by Lei et al. (2023), and use captions and label maps for some metrics. |
| Dataset Splits | Yes | On COCO we evaluate at resolution 256 × 256 by selecting the same images from the 2014 validation set (Lin et al., 2014) as Hoogeboom et al. (2023) and Agustsson et al. (2023). |
| Hardware Specification | Yes | We run benchmarking on A100 gpus using 20 denoising steps, and 5 denoising steps for bitrates higher than 0.05 bits per pixel. |
| Software Dependencies | No | The paper mentions using several libraries (e.g., BLIP-2, IDEFICS, zlib, vector-quantize-pytorch, clean-fid, Perceptual Similarity, Neural Compression), but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | We use random 512 × 512 crops, and instead of finetuning the full U-Net we found it beneficial to only finetune the linear layers present of the diffusion model, representing around 15% of all weights. We train our hyper-encoder for 5 epochs with AdamW optimizer at a peak learning rate of 1e-4 and a weight decay of 0.01, and a batch size of 160. We apply linear warmup for the first 10k training iterations. For bitrates higher that 0.05, we found it beneficial to add a LPIPS loss in image space to reconstruct more faithfully images. To finetune our latent diffusion model we used a grid of 50 timesteps. At inference time, we use 20 denoising steps to obtain good performance for the lowest bitrates. For bitrates of 0.05 and higher, we use five denoising steps as this is enough to obtain optimal performance, see Fig. 8 from the main paper. Besides, we sample images with DDIM deterministic sampling schedule, i.e. by fixing σt = 0 for all t ∈ [0, T], where T is the number of denoising steps, see Song et al. (2021). |