Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning
Authors: Andreas Schlaginhaufen, Maryam Kamgarpour
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate our results in a gridworld environment (Section 5). |
| Researcher Affiliation | Academia | Andreas Schlaginhaufen SYCAMORE, EPFL andreas.schlaginhaufen@epfl.ch Maryam Kamgarpour SYCAMORE, EPFL maryam.kamgarpour@epfl.ch |
| Pseudocode | Yes | Algorithm 1: Multi-expert IRL |
| Open Source Code | Yes | 1The code is openly accessible at https://github.com/andrschl/transfer_irl. |
| Open Datasets | Yes | To validate our results experimentally, we adopt a stochastic variant of the Windy Gridworld environment [Sutton and Barto, 2018]. |
| Dataset Splits | No | The paper does not explicitly mention a dedicated validation dataset split, only "expert data sets" for learning. |
| Hardware Specification | Yes | All our experiments were carried out within a day on a Mac Book Pro with an Apple M1 Pro chip and 32 GB of RAM. |
| Software Dependencies | No | The paper mentions general software components like "soft policy iteration" but does not specify version numbers for any libraries or dependencies. |
| Experiment Setup | Yes | Using Shannon entropy regularization with τ = 0.3, we then use soft policy iteration to get expert policies for each combination of expert reward and wind strength β. For each of these expert policies, we then generate expert data sets with N E {103, 104, 105, 106} trajectories of length H = 100. Next, we run Algorithm 1, with soft policy iteration as a subroutine, for 30 000 iterations, where rewards are initialized by sampling from a standard normal distribution. As a reward class, we choose the 1-ball with radius 103 (essentially unbounded), as a stepsize α = 0.05 for the first 15 000 iterations and α = 0.005 for the second half. Moreover, we sample N = 100 new trajectories of horizon H = 100 at each gradient step. |