Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
Authors: Gaoyue Zhou, Hengkai Pan, Yann Lecun, Lerrel Pinto
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | DINO-WM is experimentally evaluated on six environment suites spanning maze navigation, sliding manipulation, robotic arm control, and deformable object manipulation tasks. Our experiments reveal the following findings: DINO-WM produces high-quality future world modeling that can be measured by improved visual reconstruction from trained decoders. On LPIPS metrics for our hardest tasks, this improves upon prior state-of-the-art work by 56% (See Section 4.7). |
| Researcher Affiliation | Collaboration | 1Courant Institute, New York University 2Meta AI. Correspondence to: Gaoyue Zhou <EMAIL>. |
| Pseudocode | No | The paper describes the planning optimization procedures (Model Predictive Control with Cross-Entropy Method and Gradient Descent) using numbered bullet points (e.g., a), b), c)), but these are descriptive steps and not formatted as structured pseudocode blocks or algorithms with keywords like 'for loop', 'if statement', etc., or within a clearly labeled Algorithm environment. |
| Open Source Code | Yes | Code and models for DINO-WM are open-sourced to ensure reproducibility and videos of planning are made available on our project website: https://dino-wm.github.io/. |
| Open Datasets | Yes | some of which are drawn from standard robotics benchmarks, such as D4RL (Fu et al., 2021) and Deep Mind Control Suite (Tassa et al., 2018) |
| Dataset Splits | No | The paper describes how training data was generated for different environments, for example: "We generate 2000 fully random trajectories to train our world models." and "For the fixed wall setting, we train on a fully random dataset of 1920 trajectories each with 50 time steps." It also mentions evaluating on "randomly sampled goal state[s]" or "validation set", but it does not specify explicit training, validation, and test splits (e.g., as percentages or exact counts from a larger dataset) that are needed to reproduce the data partitioning in a standardized way. |
| Hardware Specification | Yes | Table 11 reports the time required on an NVIDIA A6000 GPU for a single inference step, the environment rollout time for advancing one step in the simulator, and the overall planning time for generating an optimal action sequence using the Cross-Entropy Method (CEM). |
| Software Dependencies | No | The paper mentions several software components and links to their GitHub repositories, such as DINOv2, Dreamer V3, AVDC, R3M, and vit-pytorch. It also mentions Nvidia Flex. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Table 13. Shared hyperparameters for DINO-WM training Image size 224 Optimizer Adam W Decoder lr 3e-4 Predictor lr 5e-5 Action encoder lr 5e-4 Action emb dim 10 Epochs 100 Batch size 32 |