Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

WorldMem: Long-term Consistent World Simulation with Memory

Authors: Zeqi Xiao, Yushi LAN, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Datasets. We use Mine Dojo (Fan et al., 2022) to create diverse training and evaluation datasets in Minecraft, configuring diverse environments (e.g., plains, savannas, ice plains, and deserts), agent actions, and interactions. For real-world scenes, we utilize Real Estate10K (Zhou et al., 2018) with camera pose annotations to evaluate long-term world consistency. Metrics. For quantitative evaluation, we employ reconstruction metrics, where the method of obtaining ground truth (GT) varies by specific settings. We then assess the consistency and quality of the generated videos using PSNR, LPIPS (Zhang et al., 2018), and reconstruction FID (r FID) (Heusel et al., 2017), which collectively measure pixel-level fidelity, perceptual similarity, and overall realism. Experimental details. For our experiments on Minecraft (Fan et al., 2022), we utilize the Oasis (Decart et al., 2024) as the base model. Our model is trained using the Adam optimizer with a fixed learning rate of 2 10 5. Training is conducted at a resolution of 640 360, where frames are first encoded into a latent space via a VAE at a resolution of 32 18, then further patchified to 16 9. Our training dataset comprises approximately 12K long videos, each containing 1500 frames, generated from Fan et al. (2022). During training, we employ an 8-frame temporal context window alongside an 8-frame memory window. The model is trained for approximately 500K steps using 4 GPUs, with a batch size of 4 per GPU. For the hyperparameters specified in Algorithm 1 of the main paper, we set the similarity threshold tr to 0.9, wo to 1, and wt to 0.2/tc. For the noise levels in Eq. (5) and Eq. (6), we set kmin to 15 and kmax to 1000.
Researcher Affiliation Academia Zeqi Xiao1 Yushi Lan1 Yifan Zhou1 Wenqi Ouyang1 Shuai Yang2 Yanhong Zeng3 Xingang Pan1 1S-Lab, Nanyang Technological University, 2Wangxuan Institute of Computer Technology, Peking University 3Shanghai AI Laboratory EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Memory Retrieval Algorithm Input: Memory bank of N historical states {(xm i , pi, ti)}N i=1; Current state (xc, pc, tc); memory condition length LM; Similarity threshold tr; weights wo, wt. Output: A list of selected state indices S Compute Confidence Score: Compute FOV overlap ratio o via Monte Carlo sampling. Compute time difference d = Concat({|ti tc|}n i=1). Compute confidence α = o wo d wt. Selection with Similarity Filtering: Initialize S = for m = 1 to LM do Select i with highest αi Append i to S Remove all j where similarity(i , j) > tr return S
Open Source Code No Project page at https://xizaoqu.github.io/worldmem. ... Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We do not provide open access to the data and code in the supplemental material, but will release codes and datasets publicly.
Open Datasets Yes Datasets. We use Mine Dojo (Fan et al., 2022) to create diverse training and evaluation datasets in Minecraft, configuring diverse environments (e.g., plains, savannas, ice plains, and deserts), agent actions, and interactions. For real-world scenes, we utilize Real Estate10K (Zhou et al., 2018) with camera pose annotations to evaluate long-term world consistency.
Dataset Splits Yes Our training dataset comprises approximately 12K long videos, each containing 1500 frames, generated from Fan et al. (2022). ... We evaluate both settings on 300 test videos. ... The Real Estate10K dataset provides a training set of approximately 65K short video clips. Training is conducted at a resolution of 256 256, with frames patchified to 128 128. The model is trained for approximately 50K steps using 4 GPUs, with a batch size of 8 per GPU. ... We design 5 evaluation trajectories, each starting and ending at the same pose, across 100 scenes. The trajectory lengths range from 37 to 60 frames exceeding the training lengths of all baselines (maximum 25 frames).
Hardware Specification No The model is trained for approximately 500K steps using 4 GPUs, with a batch size of 4 per GPU. ... The model is trained for approximately 50K steps using 4 GPUs, with a batch size of 8 per GPU.
Software Dependencies No Our model is trained using the Adam optimizer with a fixed learning rate of 2 10 5.
Experiment Setup Yes Experimental details. For our experiments on Minecraft (Fan et al., 2022), we utilize the Oasis (Decart et al., 2024) as the base model. Our model is trained using the Adam optimizer with a fixed learning rate of 2 10 5. Training is conducted at a resolution of 640 360, where frames are first encoded into a latent space via a VAE at a resolution of 32 18, then further patchified to 16 9. Our training dataset comprises approximately 12K long videos, each containing 1500 frames, generated from Fan et al. (2022). During training, we employ an 8-frame temporal context window alongside an 8-frame memory window. The model is trained for approximately 500K steps using 4 GPUs, with a batch size of 4 per GPU. For the hyperparameters specified in Algorithm 1 of the main paper, we set the similarity threshold tr to 0.9, wo to 1, and wt to 0.2/tc. For the noise levels in Eq. (5) and Eq. (6), we set kmin to 15 and kmax to 1000. For our experiments on Real Estate10K (Zhou et al., 2018), we adopt DFo T (Song et al., 2025) as the base model. The Real Estate10K dataset provides a training set of approximately 65K short video clips. Training is conducted at a resolution of 256 256, with frames patchified to 128 128. The model is trained for approximately 50K steps using 4 GPUs, with a batch size of 8 per GPU.