MIMT: Masked Image Modeling Transformer for Video Compression

Authors: Jinxi Xiang, Kuan Tian, Jun Zhang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the proposed MIMT framework equipped with the new transformer entropy model achieves state-of-the-art performance on HEVC, UVG, and MCLJCV datasets, generally outperforming the VVC in terms of PSNR and SSIM.
Researcher Affiliation Industry Tencent AI Lab, Shenzhen {jinxixang,kuantian,junejzhang,haroldhan,willyang}@tencent.com
Pseudocode Yes Algorithm 1: MIMT Iterative Decoding
Open Source Code No The paper states: 'We obtain the opensource code of DVC 1, SSF 2, DCVC 3, and DMC 4 for decoding efficiency comparison.' However, it does not explicitly state that the source code for the proposed MIMT model is publicly available.
Open Datasets Yes We use Vimeo-90k (Xue et al., 2019) for training. The test videos include HEVC Class B, UVG (Mercat et al., 2020), and MCL-JCV (Wang et al., 2016) datasets.
Dataset Splits No The paper mentions 'The test videos include HEVC Class B, UVG (Mercat et al., 2020), and MCL-JCV (Wang et al., 2016) datasets.' but does not explicitly define training, validation, and test dataset splits with percentages or counts.
Hardware Specification Yes We set the batch size as 8, using the Adam optimizer on a single V100 GPU.
Software Dependencies No The paper describes the model architecture and training process but does not specify software dependencies with version numbers (e.g., specific Python, library, or framework versions).
Experiment Setup Yes We set the Go P size as 32 for all datasets and use learned model (Cheng et al., 2020) for I frame compression. We train four models with different λ values {256, 512, 1024, 2048}. By default, we train models with MSE loss. When using the MS-SSIM metric, the model is fine-tuned with the MS-SSIM loss. We can apply multi-frame (up to 7 frames) and patch-size (256 256) for training. In the first stage, we use two consecutive frames... to train our model for 1 M steps. Then we add the MIMT entropy model... for 1 M steps. Finally, we extend the length... to 7 frames for 300 K steps. The learning rate is set as 5e-5. We set the batch size as 8, using the Adam optimizer on a single V100 GPU.