Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models
Authors: Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, Maneesh Agrawala
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Ablation studies validate the effectiveness of the anti-drifting methods in both single-directional video streaming and bi-directional video generation. We show that existing video diffusion models can be finetuned with Frame Pack, and analyze the differences between different packing schedules. |
| Researcher Affiliation | Academia | Lvmin Zhang1 Shengqu Cai1 Muyang Li2 Gordon Wetzstein1 Maneesh Agrawala1 1Stanford University 2MIT |
| Pseudocode | No | The paper describes methods in paragraph text and conceptual diagrams (Fig. 1 and Fig. 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: This work provides open access. |
| Open Datasets | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: This work provides open access. |
| Dataset Splits | No | The paper mentions test input sizes ('512 real user prompts for text-to-video and 512 image-prompt pairs for image-to-video tasks') and video lengths ('30 seconds for long videos and 5 seconds for short videos'), but does not specify training, validation, or test dataset splits (e.g., percentages or sample counts) within the main text. |
| Hardware Specification | Yes | We conduct all experiments using H100 GPU clusters with training details in the supplementary. Note that Frame Pack achieves a batch size of about 64 on a single 8 A100-80G node with the 13B Hunyuan Video model at 480p resolution Lo RA training with window size 2 or 3 (or batch size 32 of window size 4 or 5), making Frame Pack suitable for personal or laboratory-scale training and experimentation. We also show that the efficient implementations of Frame Pack can process thousands of frames with 13B models even on laptops (e.g., 6GB or 8GB GPU memory). |
| Software Dependencies | No | The paper mentions implementing Frame Pack with 'Wan' and 'Hunyuan Video' and using a '13B Hunyuan Video model', but does not provide specific version numbers for any software libraries, frameworks, or dependencies used. |
| Experiment Setup | Yes | Frame Pack achieves a batch size of about 64 on a single 8 A100-80G node with the 13B Hunyuan Video model at 480p resolution Lo RA training with window size 2 or 3 (or batch size 32 of window size 4 or 5), making Frame Pack suitable for personal or laboratory-scale training and experimentation. In our tests, K = 128 gives strong drift reduction with relatively minimal training difficulties. |