VideoPoet: A Large Language Model for Zero-Shot Video Generation
Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A Ross, Bryan Seybold, Lu Jiang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present results demonstrating the model s state-of-the-art capabilities in zero-shot video generation, specifically highlighting the generation of high-fidelity motions. |
| Researcher Affiliation | Collaboration | 1Google 2Carnegie Mellon University. |
| Pseudocode | No | The paper provides architectural diagrams and descriptions but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions a 'Project page: https:// sites.research.google/videopoet/' but does not explicitly state that source code for the described methodology is open-source or provided at this link. |
| Open Datasets | Yes | We train on a total of 1B image-text pairs and 270M videos ( 100M with paired text, of which 50M are used for high-quality finetuning, and 170M with paired audio) from the public internet and other sources, i.e. around 2 trillion tokens across all modalities. The data has been filtered to remove egregious content and sampled to improve contextual and demographic diversity. Evaluation protocol. We employ a zero-shot generation evaluation protocol, as the model has not been trained on the training data of target benchmarks. Specifically, the evaluation benchmark includes two text-to-video generation datasets, MSR-VTT (Xu et al., 2016) and UCF-101 (Soomro et al., 2012), as well as the frame prediction task on Kinetics 600 (K600) (Carreira et al., 2018), in which the first 5 frames are provided as the condition to predict the next 11 frames. We also include inpainting and outpainting tasks (Yu et al., 2023a) on Something-Something V2 (SSv2) (Goyal et al., 2017). |
| Dataset Splits | No | The paper describes data sources and fine-tuning subsets, but it does not specify explicit, reproducible training/validation/test splits (e.g., percentages or exact counts) for its main internal dataset. |
| Hardware Specification | Yes | For a batch size of 4 videos and generating 17 frames at 8fps using TPUv5p (4 chips) accelerators, our base model runs in 34s, the detokenizer (converting tokens to pixels) requires 1.3s and super-resolution is 6.8s thus, amortized run time is about 5 seconds per second of output video. |
| Software Dependencies | No | The paper mentions various models and tokenizers (e.g., T5 XL encoder, MAGVIT-v2 tokenizer, Sound Stream tokenizer), but it does not list specific versions for general software dependencies like Python, PyTorch, or CUDA, which are essential for reproducibility. |
| Experiment Setup | Yes | All task combinations are trained using a learning rate of 10 3 for the same number of steps (300k) with a batch size of 1024. We devise a two-stage pretraining strategy, where we augment our sampling weights to sample image data 90% of the time and video data 10% of the time for the first 25% iterations of training. We then switch to training on video 90% and image 10% for the remaining iterations. During inference, we use the sampling algorithm of MAGVIT-v2 (Yu et al., 2024) with 24 sampling steps for each stage and classifier-free guidance scale (Ho & Salimans, 2022; Brooks et al., 2023) of 4.0/8.0 for the text condition and 1.0/2.0 for the low-resolution condition, in the first/second stage. |