Flexible Diffusion Modeling of Long Videos
Authors: William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, Frank Wood
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform our main comparisons on the video completion task. In keeping with Saxena et al. [26], we condition on the first 36 frames of each video and sample the remainder. We present results on three datasets: GQN-Mazes [10], in which videos are 300 frames long; Mine RL Navigate [13, 26] (which we will from now on refer to as simply Mine RL), in which videos are 500 frames long; and the CARLA Town01 dataset we release, for which videos are 1000 frames long. |
| Researcher Affiliation | Collaboration | William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, Frank Wood Department of Computer Science University of British Columbia Vancouver, Canada {wsgh,saeidnp,vadmas,weilbach,fwood}@cs.ubc.ca Frank Wood is also affiliated with the Montréal Institute for Learning Algorithms (Mila) and Inverted AI. |
| Pseudocode | Yes | Algorithm 1 Sample a video v given a sampling scheme [(Xs, Ys)]S s=1. and Algorithm 2 Sampling training tasks X, Y u( ) given N, K. |
| Open Source Code | Yes | We include our released source code but we leave its exposition to the supplementary material. and https://github.com/plai-group/flexible-video-diffusion-modeling |
| Open Datasets | Yes | We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA autonomous driving simulator. and We release the CARLA Town 01 dataset along with code and our trained regression model to allow future comparisons. and We present results on three datasets: GQN-Mazes [10], in which videos are 300 frames long; Mine RL Navigate [13, 26] (which we will from now on refer to as simply Mine RL), in which videos are 500 frames long; and the CARLA Town01 dataset we release, for which videos are 1000 frames long. |
| Dataset Splits | No | The paper mentions '408 training and 100 test videos' for the CARLA Town01 dataset, but it does not explicitly state the use of a separate validation set or specify validation splits for any of the datasets. |
| Hardware Specification | No | The paper states that computational resources were provided by West Grid, Compute Canada, and Advanced Research Computing at the University of British Columbia, and mentions that details are in the appendix, but it does not provide specific hardware models (e.g., GPU/CPU types) in the main text. |
| Software Dependencies | No | The paper mentions that the work is implemented on PyTorch, and provides a link to its source code, but it does not explicitly list specific version numbers for ancillary software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We train FDM in all cases with the maximum number of represented frames K = 20. |