Temporally Consistent Transformers for Video Generation

Authors: Wilson Yan, Danijar Hafner, Stephen James, Pieter Abbeel

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform a comprehensive evaluation of current models and observe their limitations in temporal consistency. Moreover, we introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time. By compressing its input sequence into fewer embeddings, applying a temporal transformer, and expanding back using a spatial Mask Git, TECO outperforms existing models across many metrics.
Researcher Affiliation Collaboration 1UC Berkeley 2University of Toronto 3Deep Mind 4Dyson Robotics Lab.
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code No Videos are available on the website: https: //wilson1yan.github.io/teco. This website hosts videos and the paper, but does not explicitly provide a link to the open-source code for the described methodology.
Open Datasets Yes We introduce three challenging video datasets to better measure long-range consistency in video prediction, centered around 3D environments in DMLab (Beattie et al., 2016), Minecraft (Guss et al., 2019), and Habitat (Savva et al., 2019)... Kinetics-600 (Carreira & Zisserman, 2017) is a highly complex real-world video dataset...
Dataset Splits No The paper mentions that for Kinetics-600, "392k videos that are split for training and evaluation," but does not provide specific percentages or counts for training, validation, and test splits across any of the datasets used.
Hardware Specification Yes All models are trained for 1 million iterations under fixed compute budgets allocated for each dataset (measured in TPU-v3 days) on TPU-v3 instances ranging from v3-8 to v3-128 TPU pods (similar to 4 V100s to 64 V100s) with training times of roughly 3-5 days. Our VQ-GANs are trained on 8 A5000 GPUs, taking about 2-4 days for each dataset...
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as Python or library versions (e.g., PyTorch, TensorFlow, or CUDA versions).
Experiment Setup Yes Appendix N, titled "Hyperparameters," provides extensive details on the experimental setup, including: TPU-v3 Days, Params, Resolution, Batch Size, LR, LR Schedule, Warmup Steps, Total Training Steps, Drop Loss Rate, Encoder Depths/Blocks, Codebook Size/Embedding Dim for VQ codes, Decoder Depths/Blocks, Temporal Transformer Downsample Factor/Hidden Dim/Feedforward Dim/Heads/Layers/Dropout, Mask Schedule/Hidden Dim/Feedforward Dim/Heads/Layers/Dropout. For example, for DMLab, it lists LR as 1e-4, Batch Size 32, Sequence Length 300, and Total Training Steps 1M.