VideoPoet: A Large Language Model for Zero-Shot Video Generation

Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A Ross, Bryan Seybold, Lu Jiang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present results demonstrating the model s state-of-the-art capabilities in zero-shot video generation, specifically highlighting the generation of high-fidelity motions.
Researcher Affiliation Collaboration 1Google 2Carnegie Mellon University.
Pseudocode No The paper provides architectural diagrams and descriptions but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper mentions a 'Project page: https:// sites.research.google/videopoet/' but does not explicitly state that source code for the described methodology is open-source or provided at this link.
Open Datasets Yes We train on a total of 1B image-text pairs and 270M videos ( 100M with paired text, of which 50M are used for high-quality finetuning, and 170M with paired audio) from the public internet and other sources, i.e. around 2 trillion tokens across all modalities. The data has been filtered to remove egregious content and sampled to improve contextual and demographic diversity. Evaluation protocol. We employ a zero-shot generation evaluation protocol, as the model has not been trained on the training data of target benchmarks. Specifically, the evaluation benchmark includes two text-to-video generation datasets, MSR-VTT (Xu et al., 2016) and UCF-101 (Soomro et al., 2012), as well as the frame prediction task on Kinetics 600 (K600) (Carreira et al., 2018), in which the first 5 frames are provided as the condition to predict the next 11 frames. We also include inpainting and outpainting tasks (Yu et al., 2023a) on Something-Something V2 (SSv2) (Goyal et al., 2017).
Dataset Splits No The paper describes data sources and fine-tuning subsets, but it does not specify explicit, reproducible training/validation/test splits (e.g., percentages or exact counts) for its main internal dataset.
Hardware Specification Yes For a batch size of 4 videos and generating 17 frames at 8fps using TPUv5p (4 chips) accelerators, our base model runs in 34s, the detokenizer (converting tokens to pixels) requires 1.3s and super-resolution is 6.8s thus, amortized run time is about 5 seconds per second of output video.
Software Dependencies No The paper mentions various models and tokenizers (e.g., T5 XL encoder, MAGVIT-v2 tokenizer, Sound Stream tokenizer), but it does not list specific versions for general software dependencies like Python, PyTorch, or CUDA, which are essential for reproducibility.
Experiment Setup Yes All task combinations are trained using a learning rate of 10 3 for the same number of steps (300k) with a batch size of 1024. We devise a two-stage pretraining strategy, where we augment our sampling weights to sample image data 90% of the time and video data 10% of the time for the first 25% iterations of training. We then switch to training on video 90% and image 10% for the remaining iterations. During inference, we use the sampling algorithm of MAGVIT-v2 (Yu et al., 2024) with 24 sampling steps for each stage and classifier-free guidance scale (Ho & Salimans, 2022; Brooks et al., 2023) of 4.0/8.0 for the text condition and 1.0/2.0 for the low-resolution condition, in the first/second stage.