Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning
Authors: Sebastian Curi, Felix Berkenkamp, Andreas Krause
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms. |
| Researcher Affiliation | Collaboration | Sebastian Curi Department of Computer Science ETH Zurich scuri@inf.ethz.ch Felix Berkenkamp Bosch Center for Artificial Intelligence felix.berkenkamp@de.bosch.com Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch |
| Pseudocode | Yes | Algorithm 1 Model-based Reinforcement Learning; Algorithm 2 H-UCRL combining Optimistic Policy Search and Planning |
| Open Source Code | Yes | We provide an open-source implementation of our method, which is available at http://github.com/sebascuri/hucrl. |
| Open Datasets | No | The paper uses Mujoco environments for experiments but does not provide access information (link, DOI, formal citation) to any pre-existing public datasets used for training, as the data is generated during experimental rollouts. |
| Dataset Splits | No | The paper describes an episodic learning setting where data is collected and used. It does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'PyTorch' implicitly through the GitHub link to 'rl-lib a pytorch-based library' but does not specify exact version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Throughout the experiments, we consider reward functions of the form r(s, a) = rstate(s) ρcaction(a), where rstate(s) is the reward for being in a good state, and ρ [0, ) is a parameter that scales the action costs caction(a). ... As modeling choice, we use 5-head probabilistic ensembles as in Chua et al. (2018). ... For more experimental details and learning curves, see Appendix B. |