Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning

Authors: Andreas Schlaginhaufen, Maryam Kamgarpour

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally validate our results in a gridworld environment (Section 5).
Researcher Affiliation Academia Andreas Schlaginhaufen SYCAMORE, EPFL andreas.schlaginhaufen@epfl.ch Maryam Kamgarpour SYCAMORE, EPFL maryam.kamgarpour@epfl.ch
Pseudocode Yes Algorithm 1: Multi-expert IRL
Open Source Code Yes 1The code is openly accessible at https://github.com/andrschl/transfer_irl.
Open Datasets Yes To validate our results experimentally, we adopt a stochastic variant of the Windy Gridworld environment [Sutton and Barto, 2018].
Dataset Splits No The paper does not explicitly mention a dedicated validation dataset split, only "expert data sets" for learning.
Hardware Specification Yes All our experiments were carried out within a day on a Mac Book Pro with an Apple M1 Pro chip and 32 GB of RAM.
Software Dependencies No The paper mentions general software components like "soft policy iteration" but does not specify version numbers for any libraries or dependencies.
Experiment Setup Yes Using Shannon entropy regularization with τ = 0.3, we then use soft policy iteration to get expert policies for each combination of expert reward and wind strength β. For each of these expert policies, we then generate expert data sets with N E {103, 104, 105, 106} trajectories of length H = 100. Next, we run Algorithm 1, with soft policy iteration as a subroutine, for 30 000 iterations, where rewards are initialized by sampling from a standard normal distribution. As a reward class, we choose the 1-ball with radius 103 (essentially unbounded), as a stepsize α = 0.05 for the first 15 000 iterations and α = 0.005 for the second half. Moreover, we sample N = 100 new trajectories of horizon H = 100 at each gradient step.