Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning World Models for Interactive Video Generation

Authors: Taiye Chen, Xun Hu, Zihan Ding, Chi Jin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments For training, we collected 1000 long Minecraft gameplay videos (17 hours total) using Mine RL [62]. All videos have a fixed resolution of 640 360 pixels. Each sequence spans 1200 frames, annotated with action vectors (forward/backward movement, jumping, camera rotation) and world coordinates (x, y, z positions and yaw angle). For evaluation, we assembled two distinct test sets: (1) for compounding error evaluation, we use 20 long videos of 1200 frames with randomized actions and locations, and (2) for world coherence, we use 60 carefully curated 300-frame video sequences designed to systematically assess spatiotemporal consistency.
Researcher Affiliation Academia Taiye Chen1 Xun Hu2 Zihan Ding3 Chi Jin3 1Peking University 2University of Oxford 3Princeton University EMAIL EMAIL
Pseudocode No The paper describes methodologies in text and mathematical equations, such as in Section 3, but does not present a distinct pseudocode block or algorithm box.
Open Source Code No Justification: Code will be available later on Github with sufficient instructions to reproduce the results.
Open Datasets Yes For training, we collected 1000 long Minecraft gameplay videos (17 hours total) using Mine RL [62]. All videos have a fixed resolution of 640 360 pixels. Each sequence spans 1200 frames, annotated with action vectors (forward/backward movement, jumping, camera rotation) and world coordinates (x, y, z positions and yaw angle).
Dataset Splits Yes For training, we collected 1000 long Minecraft gameplay videos (17 hours total) using Mine RL [62]. All videos have a fixed resolution of 640 360 pixels. Each sequence spans 1200 frames, annotated with action vectors (forward/backward movement, jumping, camera rotation) and world coordinates (x, y, z positions and yaw angle). For evaluation, we assembled two distinct test sets: (1) for compounding error evaluation, we use 20 long videos of 1200 frames with randomized actions and locations, and (2) for world coherence, we use 60 carefully curated 300-frame video sequences designed to systematically assess spatiotemporal consistency.
Hardware Specification Yes All models are trained for 3 epochs on the dataset, with a batch size of 32 across 8 A100 GPUs.
Software Dependencies No The paper mentions technologies like Diffusion Transformer (Di T) and VAE, but does not specify software dependencies with specific version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x).
Experiment Setup Yes A consistent window size of 20 frames is applied for both model training and evaluation for fair comparison. For vanilla Diffusion Forcing, we additionally train a variant with window sizes of 10 frame for context length evaluation. For our VRAG method, we combine 10 retrieved frames with 10 current frames for both training and inference. We represent the agent s state using a global state vector s = [x, y, z, yaw] during training, which can be extended to incorporate a full 3D pose representation when needed. To facilitate training convergence, these values are normalized relative to the initial state, thereby reducing the complexity of the diffusion process. The Ya RN implementation extends the vanilla model (window size 20) by replacing position embeddings with Ya RN and stretching factor 4, followed by fine-tuning for 104 steps on 80-frame sequences. During evaluation of Yarn, we use a 40-frame window. The Infini-attention with neural memory employs a sliding window size 20 and stride 10, using the first 10 frames for memory state updates and the last 10 for local attention computation. The History Buffer method maintains a 124-frame buffer partitioned into 5 exponentially decreasing segments (L1 = 2, α = 2), sampling 2 frames per segment to form 10 historical frames that are concatenated with the 10 current frames. All models are trained for 3 epochs on the dataset, with a batch size of 32 across 8 A100 GPUs. We use a uniform learning rate of 8 10 5 during training. For Infini-Attention, we apply a learning rate of 3 10 3 specifically to the global weight parameter to effectively balance global and local attention contributions while maintaining stable convergence.