Video Diffusion Models with Local-Global Context Guidance

Authors: Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang, You He

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that the proposed method achieves favorable performance on video prediction, interpolation, and unconditional video generation. Our experiments demonstrate that the proposed method achieves state-of-the-art performance on video prediction, as well as favorable performance on interpolation and unconditional video generation.
Researcher Affiliation Academia Siyuan Yang 1 , Lu Zhang2 , Yu Liu1 , Zhizhuo Jiang 1 and You He 1 1Tsinghua University 2Dalian University of Technology yang-sy21@mails.tsinghua.edu.cn, zhangluu@dlut.edu.cn, {liuyu77360132, heyou f}@126.com, jiangzhizhuo@sz.tsinghua.edu.cn
Pseudocode No The paper does not contain a clearly labeled "Pseudocode" or "Algorithm" block. It describes the method in text and with equations.
Open Source Code Yes We release code at https://github.com/exisas/LGC-VD.
Open Datasets Yes Cityscapes [Cordts et al., 2016] is a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities. BAIR Robot Pushing [Ebert et al., 2017] is a common benchmark in the video literature, which consists of roughly 44000 movies of robot pushing motions at 64x64 spatial resolution.
Dataset Splits Yes This package includes a training set of 2975 videos, a validation set of 500 videos, and a test set of 1525 videos, each with 30 frames.
Hardware Specification Yes All of our models are trained with Adam on 4 NVIDIA Tesla V100s with a learning rate of 1e-4 and a batch size of 32 for Cityscapes and 192 for BAIR.
Software Dependencies No The paper mentions using "vprediction [Salimans and Ho, 2022]" to overcome a problem, but it does not provide specific version numbers for this or any other software dependencies like libraries or frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes All of our models are trained with Adam on 4 NVIDIA Tesla V100s with a learning rate of 1e-4 and a batch size of 32 for Cityscapes and 192 for BAIR. We use the cosine noise schedule in the training phase and set the diffusion step T to 1000. For both datasets, we set the total video length L to 14, the video length N for each stage to 8, and the number of conditional frames K to 2. At testing, we sample 100 steps using DDPM.