Video Diffusion Models
Authors: Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report our results on video diffusion models for unconditional video generation (Section 4.1), conditional video generation (Section 4.2), and text-conditioned video generation (Section 4.3). We evaluate our models using standard metrics such as FVD [54], FID [19], and IS [43]; details on evaluation are provided below alongside each benchmark. |
| Researcher Affiliation | Industry | Jonathan Ho jonathanho@google.com Tim Salimans salimans@google.com Alexey Gritsenko agritsenko@google.com William Chan williamchan@google.com Mohammad Norouzi mnorouzi@google.com David J. Fleet davidfleet@google.com |
| Pseudocode | No | The paper includes a diagram of the 3D U-Net architecture (Figure 1) but no pseudocode or algorithm blocks. |
| Open Source Code | No | As with prior work in generative modeling, however, our methods have the potential for causing harmful impact and could enhance malicious or unethical uses of generative models, such as fake content generation, harassment, and misinformation spread, and thus we have decided not to release our models. |
| Open Datasets | Yes | We use the data loader provided by Tensor Flow Datasets [1] without further processing, and we train on all 13,320 videos. ... We evaluate video prediction performance on BAIR Robot Pushing [17]... We additionally evaluate video prediction performance on the Kinetics-600 benchmark [27, 9]. |
| Dataset Splits | Yes | For FID and FVD, we report two numbers which are measured against the training and validation sets, respectively. ... We train unconditional models on this dataset at the 64 64 resolution and evaluate on 50 thousand randomly sampled videos from the test set. ... Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] |
| Hardware Specification | No | Appendix A. Architecture hyperparameters, training details, and compute resources are listed in Appendix A. The main paper does not explicitly detail specific hardware. |
| Software Dependencies | No | We use the data loader provided by Tensor Flow Datasets [1] without further processing... We use the C3D network [51]2 for calculating FID and IS... we condition the diffusion model on captions in the form of BERT-large embeddings [15]. No version numbers are given for these software components. |
| Experiment Setup | Yes | Table 5 reports results that verify the effectiveness of classifier-free guidance [20] on text-to-video generation. As expected, there is clear improvement in the Inception Score-like metrics with higher guidance weight, while the FID-like metrics improve and then degrade with increasing guidance weight. Similar findings have been reported on text-to-image generation [36]. Frameskip Guidance weight ... 1 1.0 ... 2.0 ... 5.0 |