Deep Hierarchical Video Compression
Authors: Ming Lu, Zhihao Duan, Fengqing Zhu, Zhan Ma
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental Results, Evaluation All the evaluation experiments are performed under the low-delay configuration., Figure 3: Compression efficiency comparison using rate-distortion (R-D) curves. |
| Researcher Affiliation | Academia | Ming Lu1,2, Zhihao Duan3, Fengqing Zhu3, and Zhan Ma1* 1School of Electronic Science and Engineering, Nanjing University 2Interdisciplinary Research Center for Future Intelligent Chips (Chip-X), Nanjing University 3Elmore Family School of Electrical and Computer Engineering, Purdue University |
| Pseudocode | No | The paper does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a direct statement about releasing source code for their proposed method or a link to a code repository. |
| Open Datasets | Yes | Datasets: We use the popular Vimeo-90K (Xue et al. 2019) dataset to train our model, which consists of 64,612 video samples. Commonly used test datasets, i.e., the UVG (Mercat, Viitanen, and Vanne 2020), MCL-JCV (Wang et al. 2016), and HEVC Class B, C, D, and E (Bossen et al. 2013), are used for evaluation. |
| Dataset Splits | No | The paper mentions using Vimeo-90K for training and several other datasets for evaluation, but it does not specify explicit training/validation/test splits, percentages, or sample counts for these datasets. |
| Hardware Specification | Yes | Our model is trained using two Nvidia RTX 3090, and the batch size is fixed at eight. We perform evaluations on a single RTX 3090-24G GPU. |
| Software Dependencies | No | The paper mentions using Adam as the optimizer but does not specify software dependencies like programming language, libraries, or frameworks with their version numbers (e.g., Python version, PyTorch/TensorFlow version). |
| Experiment Setup | Yes | We set λ from {256, 512, 1024, 2048} and {4, 8, 16, 32} for respective MSE and MS-SSIM optimized models to cover wide rate ranges. Adam (Kingma and Ba 2014) is the optimizer with the learning rate at 10 4. Our model is trained using two Nvidia RTX 3090, and the batch size is fixed at eight. We progressively train our model for fast convergence. First, the model is trained to encode a single frame independently for 2M iterations by setting the temporal prior at each scale level as a learnable bias. Then, we train the aforementioned model for 500K steps using three successive frames, with temporal priors hierarchically generated from previously-decoded frames. In the end, another 100K steps are applied to fine-tune the model using five successive frames, for which it better captures long-term temporal dependence (Liu et al. 2020a). Training batches comprise sequential frames that are randomly cropped to the size of 256 256. The first 96 frames of each video are used for evaluation, and the group of pictures (GOP) is set at 32. |