Environment Design for Inverse Reinforcement Learning
Authors: Thomas Kleine Buening, Victor Villin, Christos Dimitrakakis
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We tackle these challenges through adaptive environment design. In our framework, the learner repeatedly interacts with the expert, with the former selecting environments to identify the reward function as quickly as possible from the expert s demonstrations in said environments. This results in improvements in both sample-efficiency and robustness, as we show experimentally, for both exact and approximate inference. ... We conduct extensive experiments to evaluate our approaches (Section 6). |
| Researcher Affiliation | Academia | 1The Alan Turing Institute, London, UK 2Universit e de Neuchˆatel, Neuchˆatel, Switzerland. |
| Pseudocode | Yes | Algorithm 2 ED-BIRL: Environment Design for BIRL ... Algorithm 3 ED-AIRL: Environment Design for AIRL ... Algorithm 4 Extended Value Iteration for Structured Environments ... Algorithm 5 Environment Design with Arbitrary Environments ... Algorithm 6 AIRL-ME (AIRL with Multiple Environments) |
| Open Source Code | Yes | Implementation. The code used for all of our experiments is available at github.com/Ojig/Environment-Design-for-IRL. |
| Open Datasets | No | The paper uses well-known environments like 'Minigrid' and 'MuJoCo' to generate experimental data, and also randomly generates MDPs. However, it does not provide access to a specific, static dataset that was publicly available for training. |
| Dataset Splits | No | The paper mentions 'demo environments' and 'test environments' which are disjoint, and a 'budget of m expert trajectories'. It does not explicitly specify traditional train/validation/test dataset splits with percentages or counts. |
| Hardware Specification | Yes | Compute. Three AMD EPYC 7302P machines were used. |
| Software Dependencies | No | The paper mentions that 'All of our policies were optimised with Proximal Policy Optimisation (Schulman et al., 2017)', which refers to an algorithm, but no specific software libraries or tools with version numbers are listed. |
| Experiment Setup | Yes | For all AIRL-based algorithms, we used a two-layer Re LU network with 32 units for the state-only reward approximator and shaping functions. ... All of our policies were optimised with Proximal Policy Optimisation (Schulman et al., 2017). ... Experts for base, demo and test environments for a given task were trained with identical hyperparameters and for an equal amount of timesteps. |