Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VaporTok: RL-Driven Adaptive Video Tokenizer with Prior & Task Awareness

Authors: Minghao Yang, Zechen Bai, Jing Lin, Haoqian Wang, Alex Jinpeng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on standard video generation benchmarks confirm our analysis, showing that our adaptive approach matches or outperforms fixed-rate baselines and naive taildrop while using fewer tokens.
Researcher Affiliation Academia Minghao Yang1 Zechen Bai2 Jing Lin3 Haoqian Wang1 Alex Jinpeng Wang4 1Tsinghua University 2National University of Singapore 3Nanyang Technological University 4Central South University
Pseudocode No The paper describes methods using mathematical equations and textual explanations but does not include explicit pseudocode blocks or algorithms.
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code in supplementary.
Open Datasets Yes We conduct video reconstruction and generation experiments using the Kinetics-600[4] and UCF-101[41] datasets.
Dataset Splits No The paper mentions using a "UCF101 validation set" in tables and descriptions, implying standard splits are used, but does not explicitly define the percentages or sample counts for training, validation, and test sets. It does not provide specific split information or cite a resource defining the exact splits used.
Hardware Specification Yes Due to the high computational cost of training, we trained for 30 epochs on the UCF101 and K600 datasets using the pretrained model provided by LARP[48], which required 90 hours on 8 A100 GPUs. (...) The GRPO training process uses the UCF101 dataset for a single epoch, which takes 3 hours on a single A100 GPU. (...) The generation task is trained on the UCF101 dataset for 3000 epochs, which takes 40 hours on 8 A100 GPUs.
Software Dependencies No The paper mentions specific models (e.g., "GPT-2 backbone", "LLaMA-style transformer") and hyperparameters but does not list specific software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Implementation details. Vapor Tok first patchifies the input video into a sequence of tokens. In all experiments, we set the patch sizes to f T = 4, f H = 8, f W = 8, so that a 16 128 128 video clip is split into 4 16 16 = 1024 patches. The number of encoder query tokens is set to k = 1024. The quantizer and prior model is set as same as [48], where the factorized codebook is employed of size 8192 with embedding dimension dcodebook = 8 and prior model is adapted from a small GPT-2 backbone[35]. For taildrop probability query module, we set the number of transformer blocks as I = 2, and the softmax temperature is set to 1.8. Due to the high computational cost of training, we trained for 30 epochs on the UCF101 and K600 datasets using the pretrained model provided by LARP[48], which required 90 hours on 8 A100 GPUs. For parallel sample GRPO, we set the group size G = 8, the KL penalty weight β = 0.1, the number of inner iterations µ = 2, and the clipping bounds to ϵlow = 0.2 and ϵhigh = 0.28 as in [64]. The default reward weights for efficiency, penalty, diversity, reconstruction, and generation are set to 1:1:1:1:1. The GRPO training process uses the UCF101 dataset for a single epoch, which takes 3 hours on a single A100 GPU. For AR generative model, we adopt a LLa MA-style transformer [42]. In the class-conditional generation task on UCF-101 we prepend a [cls] token to represent the category, and a [stop] token to cease the generation process when encountering it. The generation task is trained on the UCF101 dataset for 3000 epochs, which takes 40 hours on 8 A100 GPUs.