MCVD - Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Authors: Vikram Voleti, Alexia Jolicoeur-Martineau, Chris Pal

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our approach yields SOTA results across standard video prediction and interpolation benchmarks
Researcher Affiliation Collaboration Vikram Voleti Mila, University of Montreal Canada vikram.voleti@umontreal.ca Alexia Jolicoeur-Martineau* Mila, University of Montreal Canada alexia.jolicoeur-martineau@mail.mcgill.ca Christopher Pal Mila, Polytechnique Montreal Canada CIFAR AI Chair Service Now Research
Pseudocode No The paper includes network architecture diagrams (Figure 3) and mathematical formulations but no explicit pseudocode or algorithm blocks.
Open Source Code Yes Code: https://mask-cond-video-diffusion.github.io/
Open Datasets Yes We show the results of our video prediction experiments on test data that was never seen during training in Tables 1 4 for Stochastic Moving MNIST (SMMNIST) 2, KTH 3, BAIR 4, and Cityscapes 5respectively. We present unconditional generation results for BAIR in Table 5 and UCF-101 6 in Table 6, and interpolation results for SMMNIST, KTH, and BAIR in Table 7.
Dataset Splits Yes For UCF101, each video clip is center-cropped at 240 240 and resized to 64 64, taking care to maintain the train-test splits.
Hardware Specification No Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using 4 GPUs. ...we were limited to 4 GPUs for our work here.
Software Dependencies No The paper does not explicitly list software dependencies with specific version numbers.
Experiment Setup Yes Unless otherwise specified, we set the mask probability to 0.5 when masking was used. For sampling, we report results using the sampling methods DDPM [Ho et al., 2020] or DDIM [Song et al., 2020] with only 100 sampling steps, though our models were trained with 1000, to make sampling faster. ...all our models are trained to predict only 4-5 current frames at a time