reproducibilityindex.ai

Information Directed Reward Learning for Reinforcement Learning

Authors: David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We support our ﬁndings with extensive evaluations in multiple environments and with different query types.
Researcher Affiliation	Collaboration	David Lindner Department of Computer Science ETH Zurich david.lindner@inf.ethz.ch Matteo Turchetta Department of Computer Science ETH Zurich matteo.turchetta@inf.ethz.ch Sebastian Tschiatschek Department of Computer Science University of Vienna sebastian.tschiatschek@univie.ac.at Kamil Ciosek Spotify kamilc@spotify.com Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch
Pseudocode	Yes	Algorithm 1 Information Directed Reward Learning (IDRL). The algorithm requires a set of candidate queries Qc, a Bayesian model of the reward function, and an RL algorithm that returns a policy given a reward function. ˆG(π) is the belief about the expected return of policy π, induced by the reward model P(ˆr\|D), and ˆr is the belief about the reward function.
Open Source Code	Yes	Appendices D and E describe the experimental setup in more detail, and we provide code to reproduce all experiments.4 [Footnote 4: https://github.com/david-lindner/idrl]
Open Datasets	No	The paper uses simulated environments (Gridworlds, Driver, Mu Jo Co tasks) where data is generated through interaction, rather than relying on pre-existing, publicly available datasets with explicitly defined training sets.
Dataset Splits	No	The paper uses simulated environments and does not specify training, validation, or test dataset splits in terms of percentages or sample counts, as it generates data dynamically through interactions.
Hardware Specification	No	The paper mentions experiments running on a 'single CPU' or 'single GPU' but does not provide specific CPU/GPU models, types, or other detailed hardware specifications.
Software Dependencies	No	The paper mentions 'augmented random search', 'Soft Actor-Critic algorithm (SAC)', and 'Open AI Gym' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3e-4... The batch size is 256... The policy is trained for 107 timesteps...