Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

Authors: Tianxiong Zhong, Xingye Tian, Boyuan Jiang, Xuebo Wang, Xin Tao, Pengfei Wan, Zhiwei Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 4 Experiments: Training details. Datasets. Metrics. 4.2 Video Reconstruction. Table 1: Comparison of reconstruction performance across multiple datasets for different tokenizers. 4.3 Video Generation. Table 2: Comparison of unconditional and CFG video generation in terms of g FVD and TFLOPs. Figure 7: Convergence speed of different tokenizers on UCF101. 4.4 Ablation Study. Table 4: Effectiveness of Partial Ro PE. Table 5: Symmetric and asymmetric reconstruction performance on Adobe240fps dataset. 4.5 Video Frame Interpolation.
Researcher Affiliation Collaboration 1Beijing Institute of Technology, 2Kling Team, Kuaishou Technology EMAIL, EMAIL, EMAIL, EMAIL,
Pseudocode No The paper describes the architecture and methods in sections 3.1, 3.2, and 3.3. It presents mathematical equations (1), (2), (3), (4), (6) and a flow diagram in Figure 4 illustrating the training strategy. However, it does not include any explicit figure, block, or section labeled 'Pseudocode', 'Algorithm', or structured steps formatted like code.
Open Source Code Yes The code and weights are released at: https://github.com/Kwai VGI/VFRTok.
Open Datasets Yes First, it is initialized on the Image Net-1K [8] for 30,000 steps with a batch size of 512. Then, it is pre-trained on the K600 dataset [3] for 200,000 steps with a batch size of 64, employing asymmetric FPS training f E s = f D s { 12, 18, 24, 30 }. Finally, 22 sequences of 120 FPS data from the BVI-HFR dataset [7] are added for 100,000 steps with a batch size of 16, using FPS settings f E s , f D s { 12 + 6k | k = 0, 1, . . . , 18 }. Reconstruction evaluation is performed on the K600 [3] validation set and the UCF101 [26] dataset. ... To demonstrate the advantages of VFRTok in high frame rate video generation, the Di T model was also trained on 60 FPS data from the LAVIB [27] dataset. ... For fair comparison, we adopt Adobe240fps [29] for evaluation. ... We evaluate VFRTok on sequences exhibiting large motion from the public UVG [23] dataset.
Dataset Splits Yes Reconstruction evaluation is performed on the K600 [3] validation set and the UCF101 [26] dataset. ... The Di T models [36] are trained and evaluated on the K600 [3] and UCF101 [26] datasets, respectively, using label-based Classifier-Free Guidance [12] (CFG). ... For fair comparison, we adopt Adobe240fps [29] for evaluation.
Hardware Specification Yes VFRTok-L and VFRTok-S are trained on a single node equipped with 8 H800 GPUs, requiring 4 days, respectively. ... The training cost correlates with the number of latent tokens, ranging from 5 hours using 8 H800 GPUs (VFRTok-S) to 3 days using 16 H800 GPUs (Cosmos-L [1]).
Software Dependencies No The paper mentions several components and optimizers used, such as ViT, 3D Ro PE, Adam W, DINOv2-S, and Euler diffusion sampler, and refers to the PyTorch Image Models library. However, it does not provide specific version numbers for the programming language (e.g., Python), deep learning frameworks (e.g., PyTorch), CUDA, or any other general software dependencies needed for replication.
Experiment Setup Yes L = Lrecon + λ1Lpercept + λ2 λ Ladv, λ = ϕ (Lrecon + λ1Lpercept) ϕ Ladv , where Lrecon, Lpercept, and Ladv are the L1 reconstruction loss, perceptual loss [13, 17], and adversarial loss [10], respectively, and λ represents adaptive weight. The hyperparameters are set to λ1 = 1 and λ2 = 0.2. The patch size of VFRTok is set to p F p H p W = 4 8 8. The Partial Ro PE ratio is set to τRo PE = 0.5, indicating that 6 of the 12 attention heads employ Ro PE. ... ZL R512 32 and ZS R128 128. ... VFRTok is trained in a three-stage manner. First, it is initialized on the Image Net-1K [8] for 30,000 steps with a batch size of 512. Then, it is pre-trained on the K600 dataset [3] for 200,000 steps with a batch size of 64, employing asymmetric FPS training f E s = f D s { 12, 18, 24, 30 }. Finally, 22 sequences of 120 FPS data from the BVI-HFR dataset [7] are added for 100,000 steps with a batch size of 16, using FPS settings f E s , f D s { 12 + 6k | k = 0, 1, . . . , 18 }. ... The Di T models [36] are trained and evaluated on the K600 [3] and UCF101 [26] datasets, respectively, using label-based Classifier-Free Guidance [12] (CFG). To demonstrate the advantages of VFRTok in high frame rate video generation, the Di T model was also trained on 60 FPS data from the LAVIB [27] dataset. Di Ts [36] are trained for 100,000 steps with a batch size of 128 on UCF101 [26] and K600 [3] datasets, and 50,000 steps with a batch size of 48 on LAVID [27] dataset, respectively. ... The detailed configuration of VFRTok and Lightning Di T are shown in Table 6 and Table 7.