Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training-Free Efficient Video Generation via Dynamic Token Carving

Authors: Yuechen Zhang, Jinbo Xing, bin xia, Shaoteng Liu, Bohao PENG, Xin Tao, Pengfei Wan, Eric Lo, Jiaya Jia

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Jenga achieves substantial speedups across multiple state-of-the-art video diffusion models while maintaining comparable generation quality (8.83 speedup with 0.01% performance drop on VBench).
Researcher Affiliation Collaboration 1CUHK 2HKUST 3Kuaishou Technology 4Smart More
Pseudocode Yes Algorithm 1 Progressive Resolution Framework for Jenga Video Generation Algorithm 2 Block-Sparse Attention with Conditional Enhancement Algorithm 3 Build Block-wise Attention Mask Algorithm 4 Block-Sparse Attention with Text Amplification Kernel
Open Source Code No We will open-source the code as committed.
Open Datasets Yes For qualitative evaluation, we employ the widely adopted CLIP-based metric CLIPScore [61] to measure text-video alignment, and utilize the comprehensive benchmark suites VBench [35] and VBench-I2V [62] with their original full-set prompts.
Dataset Splits Yes We conducted a user study employing the standard win-rate methodology to evaluate our approach. Questionnaires were constructed, each containing 12 randomly selected videos generated using Sora prompts [63]. The videos were presented in randomized order, and participants were asked to evaluate them along three dimensions: visual, semantic, and overall quality.
Hardware Specification Yes Unless specified, all experiments are performed on one NVIDIA H800 GPU.
Software Dependencies No The paper mentions software like Triton [58] and Flash Attention2 [17], but does not provide specific version numbers for these or other key libraries used in implementation.
Experiment Setup Yes Our experiments are primarily conducted on the Hunyuan Video [12] architecture with a 50-step configuration. All generated Hunyuan Video videos maintain a resolution of 125 720 1280, corresponding to a patchified video latent size of t h w = 32 45 80, approximately 115K tokens. For Attention Carving block partitioning, we employ Generalized Hilbert [54] as G( ) with a block size of m = 128. We implement the Attention Carving kernel using Triton [58] and adopt a progressive top-K selection strategy when computing the importance mask: k = 0.3 at stage 1, and k = 0.2 for subsequent stages. The probability threshold is set to p = 0.3. When calculating the adjacency mask Badja, it incorporates a 26-neighborhood in 3D latent space. For Pro Res stages, we provide two basic configurations Base and Turbo corresponding to implementations using 1 (straight 720P) and 2 stages (starting with 540P, 50% steps each stage). The balancing factor of the text-attention amplifier is set to ρ = 0.5. After timestep skipping, 23 of the original 50 timesteps are retained, while additional steps will be added after the stage-switch process.