Seer: Language Instructed Video Prediction with Latent Diffusion Models
Authors: Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, Yang Gao
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results on Something Something V2 (SSv2), Bridgedata and Epic Kitchens-100 datasets demonstrate our superior video prediction performance with around 480-GPU hours versus Cog Video with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the current SOTA model on SSv2 and 83.7% average preference in the human evaluation. Our project is available at https://seervideodiffusion.github.io/ |
| Researcher Affiliation | Collaboration | Xianfan Gu3 Chuan Wen1,2,3 Weirui Ye 1,2,3 Jiaming Song4 Yang Gao1,2,3, 1 IIIS, Tsinghua University 2Shanghai Artificial Intelligence Laboratory 3Shanghai Qi Zhi Institute 4NVIDIA |
| Pseudocode | No | The paper includes figures illustrating the model pipeline and modules but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our project is available at https://seervideodiffusion.github.io/ |
| Open Datasets | Yes | We conduct experiments on three text-video datasets: Something Something-V2 (SSv2) (Goyal et al., 2017), which contains videos of human daily behaviors with language instructions, Bridge Data (Ebert et al., 2021) that is rendered by a photo-realistic kitchen simulator with text prompts, and Epic Kitchens-100 (Damen et al., 2021) (Epic100), which collects human daily activities in the kitchen in egocentric vision with multi-language narrations. |
| Dataset Splits | Yes | For Bridge Data, we split the dataset into an 80% training set and 20% validation set for evaluation. We evaluate FVD and KVD on 2,048 SSv2 samples, 5,558 Bridgedata samples and 9,342 Epic100 samples in the validation sets. |
| Hardware Specification | Yes | Compare to over 480 hours with 13 8 A100 GPUs in Cog Video (Hong et al., 2023), the experiments show the high efficiency of our method: 120 hours with 4 RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Stable Diffusionv1.5' and 'CLIP text encoder' but does not provide specific version numbers for these or other ancillary software components (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We train the models with an image resolution of 256 256 on Something Something-V2 for 200k training steps, Epic Kitchens-100 and Bridge Data for 80k training steps. In the evaluation stage, we speed up the sampling process with the fast sampler DDIM (Song et al., 2020) and conditional guidance of 7.5 for 30 timesteps. See more details in Appendix C. |