Temporally Consistent Transformers for Video Generation
Authors: Wilson Yan, Danijar Hafner, Stephen James, Pieter Abbeel
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a comprehensive evaluation of current models and observe their limitations in temporal consistency. Moreover, we introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time. By compressing its input sequence into fewer embeddings, applying a temporal transformer, and expanding back using a spatial Mask Git, TECO outperforms existing models across many metrics. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2University of Toronto 3Deep Mind 4Dyson Robotics Lab. |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | No | Videos are available on the website: https: //wilson1yan.github.io/teco. This website hosts videos and the paper, but does not explicitly provide a link to the open-source code for the described methodology. |
| Open Datasets | Yes | We introduce three challenging video datasets to better measure long-range consistency in video prediction, centered around 3D environments in DMLab (Beattie et al., 2016), Minecraft (Guss et al., 2019), and Habitat (Savva et al., 2019)... Kinetics-600 (Carreira & Zisserman, 2017) is a highly complex real-world video dataset... |
| Dataset Splits | No | The paper mentions that for Kinetics-600, "392k videos that are split for training and evaluation," but does not provide specific percentages or counts for training, validation, and test splits across any of the datasets used. |
| Hardware Specification | Yes | All models are trained for 1 million iterations under fixed compute budgets allocated for each dataset (measured in TPU-v3 days) on TPU-v3 instances ranging from v3-8 to v3-128 TPU pods (similar to 4 V100s to 64 V100s) with training times of roughly 3-5 days. Our VQ-GANs are trained on 8 A5000 GPUs, taking about 2-4 days for each dataset... |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as Python or library versions (e.g., PyTorch, TensorFlow, or CUDA versions). |
| Experiment Setup | Yes | Appendix N, titled "Hyperparameters," provides extensive details on the experimental setup, including: TPU-v3 Days, Params, Resolution, Batch Size, LR, LR Schedule, Warmup Steps, Total Training Steps, Drop Loss Rate, Encoder Depths/Blocks, Codebook Size/Embedding Dim for VQ codes, Decoder Depths/Blocks, Temporal Transformer Downsample Factor/Hidden Dim/Feedforward Dim/Heads/Layers/Dropout, Mask Schedule/Hidden Dim/Feedforward Dim/Heads/Layers/Dropout. For example, for DMLab, it lists LR as 1e-4, Batch Size 32, Sequence Length 300, and Total Training Steps 1M. |