VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation

Authors: Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, Durk Kingma

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show that Video Flow achieves results that are competitive with the state-of-the-art in stochastic video prediction on the action-free BAIR dataset, with quantitative results that rival the best VAE-based models. Video Flow also produces excellent qualitative results, and avoids many of the common artifacts of models that use pixel-level mean-squared-error for training (e.g., blurry predictions), without the challenges associated with training adversarial models. We use Video Flow to model the Stochastic Movement Dataset used in (Babaeizadeh et al., 2017). We compare our model with two state-of-the-art stochastic video generation models SV2P and SAVP-VAE (Babaeizadeh et al., 2017; Lee et al., 2018) using their Tensor2Tensor implementation (Vaswani et al., 2018). We assess the quality of the generated videos using a real vs fake Amazon Mechanical Turk test. We train the baseline models, SAVP-VAE, SV2P and SVG-LP to generate 10 target frames, conditioned on 3 input frames. We extract random temporal patches of 4 frames, and train Video Flow to maximize the log-likelihood of the 4th frame given a context of 3 past frames. We evaluate Video Flow using the recently proposed Fréchet Video Distance (FVD) metric (Unterthiner et al., 2018).
Researcher Affiliation Industry Manoj Kumar , Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, Durk Kingma Google Research, Brain Team {mechcoder,mbz,dumitru,chelseaf,slevine,laurentdinh,durk}@google.com
Pseudocode No The paper describes the model architecture and its components with equations and textual descriptions (e.g., Section 4, 4.1, 4.2, and Appendix D), but it does not include any clearly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes We open-source the implementation of our code in the Tensor2Tensor codebase. We additionally open-source various components of our trained Video Flow model, to evaluate log-likelihood, to generate frames and compute latent codes as reusable TFHub modules
Open Datasets Yes We use Video Flow to model the Stochastic Movement Dataset used in (Babaeizadeh et al., 2017). We use the action-free version of the BAIR robot pushing dataset (Ebert et al., 2017) that contain videos of a Sawyer robotic arm with resolution 64x64. Similar to the Stochastic Movement Dataset as described in Section 5.1, we extract random temporal patches of 2 frames on the Moving MNIST dataset (Srivastava et al., 2015). We model the Human3.6M dataset (Ionescu et al., 2014)
Dataset Splits No The paper mentions training on "random temporal patches" and evaluating on a "holdout BAIR-action free dataset" and refers to tuning on a "validation set" (e.g., "optimal temperature tuned on the validation set using VGG similarity metrics"). However, it does not provide specific percentages or counts for the training, validation, or test splits for any of the datasets used.
Hardware Specification Yes We generate 64x64 videos of 20 frames in less than 3.5 seconds on a NVIDIA P100 GPU as compared to the fastest autoregressive model for video (Reed et al., 2017) that generates a frame every 3 seconds
Software Dependencies No The paper mentions using "Tensor2Tensor codebase" and "Adam optimizer", but it does not specify version numbers for these or any other software components, such as programming languages or libraries.
Experiment Setup Yes H.1 QUANTITATIVE BITS-PER-PIXEL: Hyperparameter Value Flow levels 3 Flow steps per level 24 Coupling Affine Number of coupling layer channels 512 Optimier Adam Batch size 40 Learning rate 3e-4 Number of 3-D residual blocks 5 Number of 3-D residual channels 256 Training steps 600K. H.2 QUALITATIVE EXPERIMENTS: Hyperparameter Value Flow levels 3 Flow steps per level 24 Coupling Additive Number of coupling layer channels 392 Optimier Adam Batch size 40 Learning rate 3e-4 Number of 3-D residual blocks 5 Number of 3-D residual channels 256 Training steps 500K.