Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion
Authors: Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, Honglak Lee
NeurIPS 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach outperforms model-free baselines on challenging continuous control benchmarks with an order-of-magnitude increase in sample efficiency, and in contrast to previous model-based approaches, performance does not degrade in complex environments.We systematically evaluate STEVE on several challenging continuous control benchmarks and demonstrate that STEVE significantly outperforms model-free baselines with an order-of-magnitude increase in sample efficiency. |
| Researcher Affiliation | Industry | Google Brain, Mountain View, CA, USA EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper includes a section titled 'Algorithm' but it describes the algorithm in text rather than providing a structured pseudocode block. |
| Open Source Code | Yes | Our code is available open-source at: https://github.com/tensorflow/models/tree/master/ research/steve |
| Open Datasets | Yes | We evaluated STEVE on a variety of continuous control tasks [3, 19]; we plot learning curves in Figure 3. ... [3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. [19] O. Klimov and J. Schulman. Roboschool. https://github.com/openai/roboschool. |
| Dataset Splits | No | The paper does not provide specific train/validation/test dataset split percentages or counts. Reinforcement learning environments are used where data is collected through interaction. |
| Hardware Specification | Yes | Both were trained on a P100 GPU and had 8 CPUs collecting data; STEVE-DPPG additionally used a second P100 to learn a model in parallel. |
| Software Dependencies | No | The paper states 'All algorithms were implemented in Tensorflow [1]' but does not provide a specific version number for TensorFlow or any other software dependencies. |
| Experiment Setup | No | For hyperparameters and additional implementation details, please see Appendix C. |