Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning 3D Persistent Embodied World Models

Authors: Siyuan Zhou, Yilun Du, Yuncong Yang, Lei Han, Peihao Chen, Dit-Yan Yeung, Chuang Gan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show significant enhancement for both the visual quality and consistency of video generation, highlighting the effectiveness of our proposed model and its associated memory mechanisms. In particular, these results confirm that incorporating a 3D memory promotes 3D persistence of embodied video generation, ensuring consistency with both previously generated and observed frames. Additionally, this persistence offers substantial advantages for the downstream robotic applications, including ranking the sampled action trajectories, planning with model predictive control, and policy training in the video simulators. Section 3: Experiment, 3.1 Dataset, 3.2 Video Generation (Table 1, Figure 3), 3.3 Planning with World Models (Table 2, Figure 6), B Additional Experimental Results (Table 5, Table 6).
Researcher Affiliation Collaboration Siyuan Zhou Yilun Du Yuncong Yang Lei Han Peihao Chen Dit-Yan Yeung Chuang Gan * HKUST UMass Amherst MIT Independent researcher
Pseudocode Yes Algorithm 1 Persistent Embodied World Model. (a) Video Generation
Open Source Code No We will release the code and dataset in the future.
Open Datasets Yes We collect our dataset in the Habitat Simulation [32] with about 1,000 scenes from HM3D [24]. We split the scenes into training scenes and test scenes.
Dataset Splits No We split the scenes into training scenes and test scenes.
Hardware Specification Yes We utilized 8 H100 GPUs for training video diffusion models in approximately 3 days.
Software Dependencies No We use Cog Video X [39] as our backbone.
Experiment Setup Yes We train our models with frame skip, where training video clips are subsampled by a specific stride. We use various frame stride from 1 to 3 to help the model learn various camera poses. We use Ada M optimizer, with linear warmup and a learning rate of 1e 4. Additionally, we utilize bf16 precision for computational efficiency and clip gradients to a maximum norm of 1.0 to stabilize training. We utilized 8 H100 GPUs for training video diffusion models in approximately 3 days. We adopt v-prediction [27] and use the DDIM sampler [25]. The inference sampling step is set to 50, and the inference time for generating the videos of 9 frames is 5 seconds. The shape of 3D grid map is 256 32 256 384 and the size of each grid is 0.25m 1m 0.25m.