Information Directed Reward Learning for Reinforcement Learning
Authors: David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We support our findings with extensive evaluations in multiple environments and with different query types. |
| Researcher Affiliation | Collaboration | David Lindner Department of Computer Science ETH Zurich david.lindner@inf.ethz.ch Matteo Turchetta Department of Computer Science ETH Zurich matteo.turchetta@inf.ethz.ch Sebastian Tschiatschek Department of Computer Science University of Vienna sebastian.tschiatschek@univie.ac.at Kamil Ciosek Spotify kamilc@spotify.com Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch |
| Pseudocode | Yes | Algorithm 1 Information Directed Reward Learning (IDRL). The algorithm requires a set of candidate queries Qc, a Bayesian model of the reward function, and an RL algorithm that returns a policy given a reward function. ˆG(π) is the belief about the expected return of policy π, induced by the reward model P(ˆr|D), and ˆr is the belief about the reward function. |
| Open Source Code | Yes | Appendices D and E describe the experimental setup in more detail, and we provide code to reproduce all experiments.4 [Footnote 4: https://github.com/david-lindner/idrl] |
| Open Datasets | No | The paper uses simulated environments (Gridworlds, Driver, Mu Jo Co tasks) where data is generated through interaction, rather than relying on pre-existing, publicly available datasets with explicitly defined training sets. |
| Dataset Splits | No | The paper uses simulated environments and does not specify training, validation, or test dataset splits in terms of percentages or sample counts, as it generates data dynamically through interactions. |
| Hardware Specification | No | The paper mentions experiments running on a 'single CPU' or 'single GPU' but does not provide specific CPU/GPU models, types, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions 'augmented random search', 'Soft Actor-Critic algorithm (SAC)', and 'Open AI Gym' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3e-4... The batch size is 256... The policy is trained for 107 timesteps... |