Clockwork Variational Autoencoders

Authors: Vaibhav Saxena, Jimmy Ba, Danijar Hafner

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets with sequences of up to 1000 frames, where CW-VAE outperforms top video prediction models. Additionally, we propose a Minecraft benchmark for long-term video prediction. We conduct several experiments to gain insights into CW-VAE and confirm that slower levels learn to represent objects that change more slowly in the video, and faster levels learn to represent faster objects.
Researcher Affiliation Collaboration Vaibhav Saxena University of Toronto vaibhav@cs.toronto.edu Jimmy Ba University of Toronto jba@cs.toronto.edu Danijar Hafner University of Toronto Google Research, Brain Team mail@danijar.com
Pseudocode No To implement this distribution and its inference model, CW-VAE utilizes the following components, l [1, L], t Tl, Encoder: el t = e(xt:t+kl 1 1) Posterior transition ql t: q(sl t | sl t 1, sl+1 t , el t) Prior transition pl t: p(sl t | sl t 1, sl+1 t ) Decoder: p(xt | s1 t). Training objective Because we cannot compute the likelihood of the training data under the model in closed form, we use the ELBO as our training objective.
Open Source Code Yes All code is publicly available at https://github.com/vaibhavsaxena11/cwvae.
Open Datasets Yes The Mine RL Navigate dataset (available under the CC Attribution-Non Commercial-Share Alike 4.0 license) was crowd sourced by Guss et al. (2019) for reinforcement learning applications. ... The KTH Action video prediction dataset (Schuldt et al., 2004) (available under the CC Attribution-Non Commercial license) ... GQN Mazes (Eslami et al., 2018) (available under the Apache License 2.0) ... For Moving MINST (Srivastava et al., 2015)
Dataset Splits No We train all models on all datasets for 300 epochs on training sequences of 100 frames of size 64 64 pixels.
Hardware Specification Yes A 3-level CW-VAE with abstraction factor 6 takes 2.5 days to train on one Nvidia Titan Xp GPU.
Software Dependencies No We use one GRU (Cho et al., 2014) per level to update the deterministic variable at every active step. All components of Equation 4 jointly optimize Equation 5 by stochastic backprop with reparameterized sampling (Kingma and Welling, 2013; Rezende et al., 2014).
Experiment Setup Yes We train all models on all datasets for 300 epochs on training sequences of 100 frames of size 64 64 pixels. For the baselines, we tune the learning rate in the range [10 4, 10 3] and the decoder stddev in the range [0.1, 1]. We use a temporal abstraction factor of 6 per level for CW-VAE, unless stated otherwise. Refer to Appendix D for hyperparameters and training durations.