Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning

Authors: Sebastian Curi, Felix Berkenkamp, Andreas Krause

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that optimistic exploration significantly speeds-up learning when there are penalties on actions, a setting that is notoriously difficult for existing model-based reinforcement learning algorithms.
Researcher Affiliation Collaboration Sebastian Curi Department of Computer Science ETH Zurich scuri@inf.ethz.ch Felix Berkenkamp Bosch Center for Artificial Intelligence felix.berkenkamp@de.bosch.com Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch
Pseudocode Yes Algorithm 1 Model-based Reinforcement Learning; Algorithm 2 H-UCRL combining Optimistic Policy Search and Planning
Open Source Code Yes We provide an open-source implementation of our method, which is available at http://github.com/sebascuri/hucrl.
Open Datasets No The paper uses Mujoco environments for experiments but does not provide access information (link, DOI, formal citation) to any pre-existing public datasets used for training, as the data is generated during experimental rollouts.
Dataset Splits No The paper describes an episodic learning setting where data is collected and used. It does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'PyTorch' implicitly through the GitHub link to 'rl-lib a pytorch-based library' but does not specify exact version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes Throughout the experiments, we consider reward functions of the form r(s, a) = rstate(s) ρcaction(a), where rstate(s) is the reward for being in a good state, and ρ [0, ) is a parameter that scales the action costs caction(a). ... As modeling choice, we use 5-head probabilistic ensembles as in Chua et al. (2018). ... For more experimental details and learning curves, see Appendix B.