Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SOMBRL: Scalable and Optimistic Model-Based RL

Authors: Bhavya, Lenart Treven, Carmelo Sferrazza, Florian Dorfler, Pieter Abbeel, Andreas Krause

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SOMBRL on state-based and visual-control environments, where it displays strong performance across all tasks and baselines. We also evaluate SOMBRL on a dynamic RC car hardware and show SOMBRL outperforms the state-of-the-art, illustrating the benefits of principled exploration for MBRL.
Researcher Affiliation Academia Bhavya Sukhija Department of Computer Science ETH Zurich EMAIL Lenart Treven Department of Computer Science ETH Zurich EMAIL Carmelo Sferrazza Berkeley AI Research UC Berkeley EMAIL Florian Dörfler Department of Electrical Engineering ETH Zurich EMAIL Pieter Abbeel Berkeley AI Research UC Berkeley EMAIL Andreas Krause Department of Computer Science ETH Zurich EMAIL
Pseudocode No The paper describes the methodology in prose and mathematical equations (e.g., Section 4 and B.4), but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code No We base our algorithm on top of open-source repositories and provide all details and hyperparameters in Appendix D. Our experiments are based on different repositories depending on the base algorithm. We will provide our final code (which merges all dependencies and repositories) with the camera-ready version.
Open Datasets Yes We consider the Deep Mind control (DMC) benchmark (Tassa et al., 2018) for the state-based and visual control tasks and test on environments with varying dimensionality4. We also evaluate on several environments from the Atari benchmark (Bellemare et al., 2013) for the visual control tasks.
Dataset Splits No The paper uses standard RL benchmarks (DMC, Atari) but does not specify traditional train/test/validation data splits in terms of percentages or sample counts for data itself. It describes evaluation based on 'episodic returns using the median over 5 seeds' and distinguishes between 'episodic' and 'nonepisodic' settings, which relate to how data is collected during interaction with the environment, rather than predefined dataset splits.
Hardware Specification Yes Table 3: Computation cost comparison for SOMBRL with different base algorithms. MBPO-MEAN 9.6 +/0.2 min (Time per 100k steps, 1 ensemble, GPU: NVIDIA Ge Force RTX 2080 Ti) MBPO-OPTIMISTIC 13.7 +/0.35 min (Time per 100k steps, 5 ensembles, GPU: NVIDIA Ge Force RTX 2080 Ti) DREAMER 42.24 +/0.95 min (Time per 100k steps, GPU: NVIDIA Ge Force RTX 4090) DREAMER-OPTIMISTIC 46.32 +/0.34 min (Time per 100k steps, 5 ensembles, GPU: NVIDIA Ge Force RTX 4090)
Software Dependencies No Appendix D mentions using 'DREAMERV3 As the base model' and refers to its 'official DREAMERV3 implementation (https://github.com/ danijar/dreamerv3/tree/main)' and 'Sukhija et al. (2024b)11 official implementation: https://github.com/lasgroup/opax'. However, specific version numbers for general software components like Python, PyTorch, TensorFlow, or CUDA are not provided.
Experiment Setup Yes D.1 MBPO-OPTIMISTIC: 'For all tasks we use a (256, 256) neural network architecture with 5 ensembles, except for the humanoid and quadruped tasks where we use (512, 512).' 'D.2 DREAMER-OPTIMISTIC: We initialize λ with 2 and pick α = 0.001. For the rest, we use the same hyperparemters as DREAMER9. We use the 12 million size model and the official DREAMERV3 implementation' 'D.3 SIMFSVGD-OPTIMISTIC: For λn we found that a linearly decaying schedule worked the best. Therefore, we linearly interpolated from λ0 = 0.5 and λ10 = 0.' 'D.4 GP experiments: For all the experiments, we use λn = 10 and for planning the i CEM optimizer (Pinneri et al., 2021).'