Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Joshua V. Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A Ross, Bryan Seybold, Lu Jiang

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present results demonstrating the model s state-of-the-art capabilities in zero-shot video generation, speciﬁcally highlighting the generation of high-ﬁdelity motions.
Researcher Affiliation	Collaboration	1Google 2Carnegie Mellon University.
Pseudocode	No	The paper provides architectural diagrams and descriptions but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions a 'Project page: https:// sites.research.google/videopoet/' but does not explicitly state that source code for the described methodology is open-source or provided at this link.
Open Datasets	Yes	We train on a total of 1B image-text pairs and 270M videos ( 100M with paired text, of which 50M are used for high-quality ﬁnetuning, and 170M with paired audio) from the public internet and other sources, i.e. around 2 trillion tokens across all modalities. The data has been ﬁltered to remove egregious content and sampled to improve contextual and demographic diversity. Evaluation protocol. We employ a zero-shot generation evaluation protocol, as the model has not been trained on the training data of target benchmarks. Speciﬁcally, the evaluation benchmark includes two text-to-video generation datasets, MSR-VTT (Xu et al., 2016) and UCF-101 (Soomro et al., 2012), as well as the frame prediction task on Kinetics 600 (K600) (Carreira et al., 2018), in which the ﬁrst 5 frames are provided as the condition to predict the next 11 frames. We also include inpainting and outpainting tasks (Yu et al., 2023a) on Something-Something V2 (SSv2) (Goyal et al., 2017).
Dataset Splits	No	The paper describes data sources and fine-tuning subsets, but it does not specify explicit, reproducible training/validation/test splits (e.g., percentages or exact counts) for its main internal dataset.
Hardware Specification	Yes	For a batch size of 4 videos and generating 17 frames at 8fps using TPUv5p (4 chips) accelerators, our base model runs in 34s, the detokenizer (converting tokens to pixels) requires 1.3s and super-resolution is 6.8s thus, amortized run time is about 5 seconds per second of output video.
Software Dependencies	No	The paper mentions various models and tokenizers (e.g., T5 XL encoder, MAGVIT-v2 tokenizer, Sound Stream tokenizer), but it does not list specific versions for general software dependencies like Python, PyTorch, or CUDA, which are essential for reproducibility.
Experiment Setup	Yes	All task combinations are trained using a learning rate of 10 3 for the same number of steps (300k) with a batch size of 1024. We devise a two-stage pretraining strategy, where we augment our sampling weights to sample image data 90% of the time and video data 10% of the time for the ﬁrst 25% iterations of training. We then switch to training on video 90% and image 10% for the remaining iterations. During inference, we use the sampling algorithm of MAGVIT-v2 (Yu et al., 2024) with 24 sampling steps for each stage and classiﬁer-free guidance scale (Ho & Salimans, 2022; Brooks et al., 2023) of 4.0/8.0 for the text condition and 1.0/2.0 for the low-resolution condition, in the ﬁrst/second stage.