Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
Authors: Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, Abhinav Shrivastava
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments demonstrate LARP s strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs). |
| Researcher Affiliation | Academia | Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen , , Abhinav Shrivastava University of Maryland, College Park EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology and model architecture through textual descriptions and figures (e.g., Figure 1, Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: https://hywang66.github.io/larp/. The paper provides a project page URL, which is a general overview page, but no explicit statement about code release or a direct link to a code repository. |
| Open Datasets | Yes | We conduct video reconstruction and generation experiments using the Kinetics-600 (K600)(Carreira et al., 2018) and UCF-101(Soomro, 2012) datasets. |
| Dataset Splits | No | Our default AR generative model consists of 632M parameters, as specified in Table 1. It is trained on the training split of the UCF-101 dataset for 1000 epochs with a batch size of 32. |
| Hardware Specification | No | The authors acknowledge UMD s supercomputing resources made available for conducting this research. |
| Software Dependencies | No | The Adam optimizer (Kingma, 2014) is used with a base learning rate of 1e 4, β1 = 0.9, and β2 = 0.95, following a warm-up cosine learning rate schedule. The Adam W optimizer (Loshchilov, 2017) is used with β1 = 0.9, β2 = 0.95, a weight decay of 0.05, and a base learning rate of 6e 4, following a warm-up cosine learning rate schedule. |
| Experiment Setup | Yes | In all experiments, the patch sizes are set to f T = 4, f H = 8, and f W = 8, respectively. As a result, a 16 128 128 video clip is split into 4 16 16 = 1024 video patches, which are projected into 1024 continuous patch embeddings in the first layer of LARP. For the SVQ quantizer, we utilize a factorized codebook with a size of 8192 and a dimension of d = 8, following the recommendations of Yu et al. (2021). The softmax normalization in Equation (6) is applied with a temperature of 0.03. The AR prior model in LARP is adapted from a small GPT-2 model (Radford et al., 2019), consisting of only 21.7M parameters. Scheduled sampling for the AR prior model employs a linear warm-up for the mixing rate, starting from 0 and reaching a peak of 0.5 at 30% of the total training steps. We set AR prior loss weight α = 0.06 in our main experiments, and use a learning rate multiplier of 50. Our default AR generative model consists of 632M parameters, as specified in Table 1. It is trained on the training split of the UCF-101 dataset for 1000 epochs with a batch size of 32. The model used in the last row of Table 1, which also has 632M parameters, is trained for 3000 epochs on UCF-101 with a batch size of 64. When generating videos, we apply a small Classifier-Free Guidance (CFG) scale of 1.25 (Ho & Salimans, 2022). |