Video Diffusion Models

Authors: Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report our results on video diffusion models for unconditional video generation (Section 4.1), conditional video generation (Section 4.2), and text-conditioned video generation (Section 4.3). We evaluate our models using standard metrics such as FVD [54], FID [19], and IS [43]; details on evaluation are provided below alongside each benchmark.
Researcher Affiliation Industry Jonathan Ho jonathanho@google.com Tim Salimans salimans@google.com Alexey Gritsenko agritsenko@google.com William Chan williamchan@google.com Mohammad Norouzi mnorouzi@google.com David J. Fleet davidfleet@google.com
Pseudocode No The paper includes a diagram of the 3D U-Net architecture (Figure 1) but no pseudocode or algorithm blocks.
Open Source Code No As with prior work in generative modeling, however, our methods have the potential for causing harmful impact and could enhance malicious or unethical uses of generative models, such as fake content generation, harassment, and misinformation spread, and thus we have decided not to release our models.
Open Datasets Yes We use the data loader provided by Tensor Flow Datasets [1] without further processing, and we train on all 13,320 videos. ... We evaluate video prediction performance on BAIR Robot Pushing [17]... We additionally evaluate video prediction performance on the Kinetics-600 benchmark [27, 9].
Dataset Splits Yes For FID and FVD, we report two numbers which are measured against the training and validation sets, respectively. ... We train unconditional models on this dataset at the 64 64 resolution and evaluate on 50 thousand randomly sampled videos from the test set. ... Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]
Hardware Specification No Appendix A. Architecture hyperparameters, training details, and compute resources are listed in Appendix A. The main paper does not explicitly detail specific hardware.
Software Dependencies No We use the data loader provided by Tensor Flow Datasets [1] without further processing... We use the C3D network [51]2 for calculating FID and IS... we condition the diffusion model on captions in the form of BERT-large embeddings [15]. No version numbers are given for these software components.
Experiment Setup Yes Table 5 reports results that verify the effectiveness of classifier-free guidance [20] on text-to-video generation. As expected, there is clear improvement in the Inception Score-like metrics with higher guidance weight, while the FID-like metrics improve and then degrade with increasing guidance weight. Similar findings have been reported on text-to-image generation [36]. Frameskip Guidance weight ... 1 1.0 ... 2.0 ... 5.0