Information Directed Reward Learning for Reinforcement Learning

Authors: David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We support our findings with extensive evaluations in multiple environments and with different query types.
Researcher Affiliation Collaboration David Lindner Department of Computer Science ETH Zurich david.lindner@inf.ethz.ch Matteo Turchetta Department of Computer Science ETH Zurich matteo.turchetta@inf.ethz.ch Sebastian Tschiatschek Department of Computer Science University of Vienna sebastian.tschiatschek@univie.ac.at Kamil Ciosek Spotify kamilc@spotify.com Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch
Pseudocode Yes Algorithm 1 Information Directed Reward Learning (IDRL). The algorithm requires a set of candidate queries Qc, a Bayesian model of the reward function, and an RL algorithm that returns a policy given a reward function. ˆG(π) is the belief about the expected return of policy π, induced by the reward model P(ˆr|D), and ˆr is the belief about the reward function.
Open Source Code Yes Appendices D and E describe the experimental setup in more detail, and we provide code to reproduce all experiments.4 [Footnote 4: https://github.com/david-lindner/idrl]
Open Datasets No The paper uses simulated environments (Gridworlds, Driver, Mu Jo Co tasks) where data is generated through interaction, rather than relying on pre-existing, publicly available datasets with explicitly defined training sets.
Dataset Splits No The paper uses simulated environments and does not specify training, validation, or test dataset splits in terms of percentages or sample counts, as it generates data dynamically through interactions.
Hardware Specification No The paper mentions experiments running on a 'single CPU' or 'single GPU' but does not provide specific CPU/GPU models, types, or other detailed hardware specifications.
Software Dependencies No The paper mentions 'augmented random search', 'Soft Actor-Critic algorithm (SAC)', and 'Open AI Gym' but does not provide specific version numbers for these software components.
Experiment Setup Yes We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3e-4... The batch size is 256... The policy is trained for 107 timesteps...