Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Authors: Gaoyue Zhou, Hengkai Pan, Yann Lecun, Lerrel Pinto

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental DINO-WM is experimentally evaluated on six environment suites spanning maze navigation, sliding manipulation, robotic arm control, and deformable object manipulation tasks. Our experiments reveal the following findings: DINO-WM produces high-quality future world modeling that can be measured by improved visual reconstruction from trained decoders. On LPIPS metrics for our hardest tasks, this improves upon prior state-of-the-art work by 56% (See Section 4.7).
Researcher Affiliation Collaboration 1Courant Institute, New York University 2Meta AI. Correspondence to: Gaoyue Zhou <EMAIL>.
Pseudocode No The paper describes the planning optimization procedures (Model Predictive Control with Cross-Entropy Method and Gradient Descent) using numbered bullet points (e.g., a), b), c)), but these are descriptive steps and not formatted as structured pseudocode blocks or algorithms with keywords like 'for loop', 'if statement', etc., or within a clearly labeled Algorithm environment.
Open Source Code Yes Code and models for DINO-WM are open-sourced to ensure reproducibility and videos of planning are made available on our project website: https://dino-wm.github.io/.
Open Datasets Yes some of which are drawn from standard robotics benchmarks, such as D4RL (Fu et al., 2021) and Deep Mind Control Suite (Tassa et al., 2018)
Dataset Splits No The paper describes how training data was generated for different environments, for example: "We generate 2000 fully random trajectories to train our world models." and "For the fixed wall setting, we train on a fully random dataset of 1920 trajectories each with 50 time steps." It also mentions evaluating on "randomly sampled goal state[s]" or "validation set", but it does not specify explicit training, validation, and test splits (e.g., as percentages or exact counts from a larger dataset) that are needed to reproduce the data partitioning in a standardized way.
Hardware Specification Yes Table 11 reports the time required on an NVIDIA A6000 GPU for a single inference step, the environment rollout time for advancing one step in the simulator, and the overall planning time for generating an optimal action sequence using the Cross-Entropy Method (CEM).
Software Dependencies No The paper mentions several software components and links to their GitHub repositories, such as DINOv2, Dreamer V3, AVDC, R3M, and vit-pytorch. It also mentions Nvidia Flex. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Table 13. Shared hyperparameters for DINO-WM training Image size 224 Optimizer Adam W Decoder lr 3e-4 Predictor lr 5e-5 Action encoder lr 5e-4 Action emb dim 10 Epochs 100 Batch size 32