MCVD - Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation
Authors: Vikram Voleti, Alexia Jolicoeur-Martineau, Chris Pal
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our approach yields SOTA results across standard video prediction and interpolation benchmarks |
| Researcher Affiliation | Collaboration | Vikram Voleti Mila, University of Montreal Canada vikram.voleti@umontreal.ca Alexia Jolicoeur-Martineau* Mila, University of Montreal Canada alexia.jolicoeur-martineau@mail.mcgill.ca Christopher Pal Mila, Polytechnique Montreal Canada CIFAR AI Chair Service Now Research |
| Pseudocode | No | The paper includes network architecture diagrams (Figure 3) and mathematical formulations but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https://mask-cond-video-diffusion.github.io/ |
| Open Datasets | Yes | We show the results of our video prediction experiments on test data that was never seen during training in Tables 1 4 for Stochastic Moving MNIST (SMMNIST) 2, KTH 3, BAIR 4, and Cityscapes 5respectively. We present unconditional generation results for BAIR in Table 5 and UCF-101 6 in Table 6, and interpolation results for SMMNIST, KTH, and BAIR in Table 7. |
| Dataset Splits | Yes | For UCF101, each video clip is center-cropped at 240 240 and resized to 64 64, taking care to maintain the train-test splits. |
| Hardware Specification | No | Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using 4 GPUs. ...we were limited to 4 GPUs for our work here. |
| Software Dependencies | No | The paper does not explicitly list software dependencies with specific version numbers. |
| Experiment Setup | Yes | Unless otherwise specified, we set the mask probability to 0.5 when masking was used. For sampling, we report results using the sampling methods DDPM [Ho et al., 2020] or DDIM [Song et al., 2020] with only 100 sampling steps, though our models were trained with 1000, to make sampling faster. ...all our models are trained to predict only 4-5 current frames at a time |