Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Video World Models with Long-term Spatial Memory

Authors: Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation. Our evaluations show that the quality and 3D consistency of our approach surpasses that of relevant baselines. 4 Experiments Implementation Details. Metrics and Baselines. Quantitative Evaluation Qualitative Evaluation User Study Ablation Study
Researcher Affiliation Collaboration 1 Stanford University 2 Shanghai Jiao Tong University 3 The Chinese University of Hong Kong 4 Shanghai Artificial Intelligence Laboratory 5 S-Lab, Nanyang Technological University
Pseudocode No The paper describes the methodology in prose and through figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will make code and the pretained model available.
Open Datasets Yes Our dataset can be downloaded here 1. 1https://huggingface.co/datasets/ysmikey/spmem_megadata
Dataset Splits Yes Dataset Construction. We build our dataset from raw videos collected from Mira Data [42], segmenting each video into multiple 97-frame clips. For each clip, the first 49 frames serve as the source sequence and the remaining 48 as the target, with a shared transition frame between the source and target sequence to preserve temporal continuity. Our test set includes 500 randomly selected video sequences from Mira Data, which are not seen during training.
Hardware Specification Yes We trained for 6,000 iterations with a learning rate of 2 10 5, using a mini-batch size of 8 and are conducted on eight NVIDIA-A100 GPUs.
Software Dependencies No Our prototype implementation builds on Cog Video X [79], which adopts this two-stage framework. While a specific base model is mentioned, no version numbers for this model or any other software libraries are provided.
Experiment Setup Yes Implementation Details. We implement our conditional video diffusion model based on Cog Video X-5B-I2V [79] architecture, pretrained from Da S [26]. During training, we set the video length to 49 frames with a resolution of 480 720. We trained for 6,000 iterations with a learning rate of 2 10 5, using a mini-batch size of 8 and are conducted on eight NVIDIA-A100 GPUs. At inference time, we adopt the latest 5 historical frames from the recent sequence to enable smooth motion prediction.