Scaling Autoregressive Video Models
Authors: Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We obtain strong results on popular benchmarks (Section 4.2, Appendix A) and produce high fidelity video continuations on the BAIR robot pushing dataset (Ebert et al., 2017) exhibiting plausible object interactions. Furthermore, our model achieves an almost 50% reduction in perplexity compared to prior work on autoregressive models on another robot pushing dataset. |
| Researcher Affiliation | Industry | Dirk Weissenborn Google Research diwe@google.com Oscar T ackstr om Sana Labs oscar@sanalabs.com Jakob Uszkoreit Google Research usz@google.com |
| Pseudocode | No | The paper describes the architecture and process flow using text and a diagram (Figure 1), but it does not include formal pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper provides a link for sample videos, not the source code for the methodology: 'Sample videos strips of each model and dataset can be found in Appendix F and sample videos at https://bit.ly/2Zb017f.' |
| Open Datasets | Yes | We focus our evaluation on the BAIR Robot Pushing and Kinetics datasets. Additional results on Moving MNIST and another robot pushing dataset are provided in Appendix A for reference. |
| Dataset Splits | Yes | BAIR Robot Pushing (Ebert et al., 2017) shows a robotic arm pushing and grasping objects in a box. It consists of roughly 40K trainingand 256 test videos. ... Moving MNIST (Srivastava et al., 2015) consists of 100K trainingand 10K validation/test videos ... Robotic Pushing (Finn et al., 2016a) ... roughly 50K training videos and 1500 test videos... |
| Hardware Specification | Yes | Our formulation can be implemented efficiently on Tensor Processing Units, or TPUs (Jouppi et al., 2017). ... We are able to sample a batch of four 30x64x64 videos in acceptable time (approx. 8 minutes) with our large models on a Nvidia Tesla V100. ... Furthermore, for our large models we scale the batch size to 256 by training in parallel on 128 TPU v3 instances for 1M steps. |
| Software Dependencies | No | The paper mentions 'RMSProp (Tieleman & Hinton, 2012)' as the optimizer and 'ReLU activation', but does not provide specific version numbers for any software libraries, frameworks (like TensorFlow or PyTorch), or programming languages used. |
| Experiment Setup | Yes | Unless specified otherwise, we model video slices of 4 frames with a spatial resolution of 32x32. Both the encoder and decoder consist of 8 layers... We apply block-local self-attention with the following block sizes (t, h, w). Layers 1-4: (4, 8, 4); (4, 4, 8); (1, 32, 4); and (1, 4, 32). ... There are na = 8 attention heads, each with hidden size da = 128. Our base models are trained with embedding size de = 128 and hidden size of d = 512 (46M parameters). ... All models are trained with RMSProp (Tieleman & Hinton, 2012) with a fixed learning rate of 2 10-5, decay of 0.95 and momentum of 0.9. We use a batch size of 64 video slices... The smaller models are trained for 300K steps and the larger ones for 1M steps. |