VCT: A Video Compression Transformer

Authors: Fabian Mentzer, George D Toderici, David Minnen, Sergi Caelles, Sung Jin Hwang, Mario Lucic, Eirikur Agustsson

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research.
Researcher Affiliation Industry Fabian Mentzer Google Research mentzer@google.com George Toderici Google Research gtoderici@google.com David Minnen Google Research dminnen@google.com Sung Jin Hwang Google Research sjhwang@google.com Sergi Caelles Google Research scaelles@google.com Mario Lucic Google Research lucic@google.com Eirikur Agustsson Google Research eirikur@google.com
Pseudocode No The paper describes its methods textually and with diagrams (Fig. 1, 2, 3) but does not include any pseudocode or formal algorithm blocks.
Open Source Code No We cannot release training data but will release code if the paper is published.
Open Datasets Yes We evaluate on two common benchmark data sets: (1) MCL-JCV [36, MIT Licence] made up of thirty 1080p videos captured at either 25 or 30FPS and averaging 137 frames per video, and (2) UVG [25, CC-BY-NC Licence] containing twelve 1080p 120FPS videos with either 300 or 600 frames each.
Dataset Splits No The paper describes training on 'one million Internet video clips' and evaluation on MCL-JCV and UVG, but does not explicitly provide details about train/validation/test dataset splits or percentages.
Hardware Specification Yes We train all models on 4 Google Cloud TPUv4 chips. To obtain runtimes of the transformers (Tsep, Tjoint, Tcur) and the decoder (D), we employ a Google Cloud TPU v4 (single core) using Flax [16], which has an efficient implementation for autoregressive transformers. We use Tensorflow Compression to measure time spent entropy coding (EC), on an Intel Skylake CPU core.
Software Dependencies No The paper mentions using Flax and TensorFlow Compression but does not provide specific version numbers for these software components.
Experiment Setup Yes To train, we use random spatio-temporal crops of (B, NF , 256, 256, 3) pixels, where B is the batch size, and NF the number of frames (values are given in Tab. 1). We use the linearly decaying learning rate (LR) schedule with warmup, where we warmup for 10k steps and then linearly decay from the LR shown in the table to 1E 5. Stage I is trained using λ=0.01. To navigate the rate-distortion trade-off and obtain results for multiple rates, we fine-tune 9 models in Stage III, using λ=0.01 2i, i { 3, . . . , 5}.