Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

Authors: Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, Abhinav Shrivastava

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate LARP s strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).
Researcher Affiliation Academia Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen , , Abhinav Shrivastava University of Maryland, College Park EMAIL EMAIL
Pseudocode No The paper describes the methodology and model architecture through textual descriptions and figures (e.g., Figure 1, Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Project page: https://hywang66.github.io/larp/. The paper provides a project page URL, which is a general overview page, but no explicit statement about code release or a direct link to a code repository.
Open Datasets Yes We conduct video reconstruction and generation experiments using the Kinetics-600 (K600)(Carreira et al., 2018) and UCF-101(Soomro, 2012) datasets.
Dataset Splits No Our default AR generative model consists of 632M parameters, as specified in Table 1. It is trained on the training split of the UCF-101 dataset for 1000 epochs with a batch size of 32.
Hardware Specification No The authors acknowledge UMD s supercomputing resources made available for conducting this research.
Software Dependencies No The Adam optimizer (Kingma, 2014) is used with a base learning rate of 1e 4, β1 = 0.9, and β2 = 0.95, following a warm-up cosine learning rate schedule. The Adam W optimizer (Loshchilov, 2017) is used with β1 = 0.9, β2 = 0.95, a weight decay of 0.05, and a base learning rate of 6e 4, following a warm-up cosine learning rate schedule.
Experiment Setup Yes In all experiments, the patch sizes are set to f T = 4, f H = 8, and f W = 8, respectively. As a result, a 16 128 128 video clip is split into 4 16 16 = 1024 video patches, which are projected into 1024 continuous patch embeddings in the first layer of LARP. For the SVQ quantizer, we utilize a factorized codebook with a size of 8192 and a dimension of d = 8, following the recommendations of Yu et al. (2021). The softmax normalization in Equation (6) is applied with a temperature of 0.03. The AR prior model in LARP is adapted from a small GPT-2 model (Radford et al., 2019), consisting of only 21.7M parameters. Scheduled sampling for the AR prior model employs a linear warm-up for the mixing rate, starting from 0 and reaching a peak of 0.5 at 30% of the total training steps. We set AR prior loss weight α = 0.06 in our main experiments, and use a learning rate multiplier of 50. Our default AR generative model consists of 632M parameters, as specified in Table 1. It is trained on the training split of the UCF-101 dataset for 1000 epochs with a batch size of 32. The model used in the last row of Table 1, which also has 632M parameters, is trained for 3000 epochs on UCF-101 with a batch size of 64. When generating videos, we apply a small Classifier-Free Guidance (CFG) scale of 1.25 (Ho & Salimans, 2022).