Clockwork Variational Autoencoders
Authors: Vaibhav Saxena, Jimmy Ba, Danijar Hafner
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets with sequences of up to 1000 frames, where CW-VAE outperforms top video prediction models. Additionally, we propose a Minecraft benchmark for long-term video prediction. We conduct several experiments to gain insights into CW-VAE and confirm that slower levels learn to represent objects that change more slowly in the video, and faster levels learn to represent faster objects. |
| Researcher Affiliation | Collaboration | Vaibhav Saxena University of Toronto vaibhav@cs.toronto.edu Jimmy Ba University of Toronto jba@cs.toronto.edu Danijar Hafner University of Toronto Google Research, Brain Team mail@danijar.com |
| Pseudocode | No | To implement this distribution and its inference model, CW-VAE utilizes the following components, l [1, L], t Tl, Encoder: el t = e(xt:t+kl 1 1) Posterior transition ql t: q(sl t | sl t 1, sl+1 t , el t) Prior transition pl t: p(sl t | sl t 1, sl+1 t ) Decoder: p(xt | s1 t). Training objective Because we cannot compute the likelihood of the training data under the model in closed form, we use the ELBO as our training objective. |
| Open Source Code | Yes | All code is publicly available at https://github.com/vaibhavsaxena11/cwvae. |
| Open Datasets | Yes | The Mine RL Navigate dataset (available under the CC Attribution-Non Commercial-Share Alike 4.0 license) was crowd sourced by Guss et al. (2019) for reinforcement learning applications. ... The KTH Action video prediction dataset (Schuldt et al., 2004) (available under the CC Attribution-Non Commercial license) ... GQN Mazes (Eslami et al., 2018) (available under the Apache License 2.0) ... For Moving MINST (Srivastava et al., 2015) |
| Dataset Splits | No | We train all models on all datasets for 300 epochs on training sequences of 100 frames of size 64 64 pixels. |
| Hardware Specification | Yes | A 3-level CW-VAE with abstraction factor 6 takes 2.5 days to train on one Nvidia Titan Xp GPU. |
| Software Dependencies | No | We use one GRU (Cho et al., 2014) per level to update the deterministic variable at every active step. All components of Equation 4 jointly optimize Equation 5 by stochastic backprop with reparameterized sampling (Kingma and Welling, 2013; Rezende et al., 2014). |
| Experiment Setup | Yes | We train all models on all datasets for 300 epochs on training sequences of 100 frames of size 64 64 pixels. For the baselines, we tune the learning rate in the range [10 4, 10 3] and the decoder stddev in the range [0.1, 1]. We use a temporal abstraction factor of 6 per level for CW-VAE, unless stated otherwise. Refer to Appendix D for hyperparameters and training durations. |