Environment Design for Inverse Reinforcement Learning

Authors: Thomas Kleine Buening, Victor Villin, Christos Dimitrakakis

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We tackle these challenges through adaptive environment design. In our framework, the learner repeatedly interacts with the expert, with the former selecting environments to identify the reward function as quickly as possible from the expert s demonstrations in said environments. This results in improvements in both sample-efficiency and robustness, as we show experimentally, for both exact and approximate inference. ... We conduct extensive experiments to evaluate our approaches (Section 6).
Researcher Affiliation Academia 1The Alan Turing Institute, London, UK 2Universit e de Neuchˆatel, Neuchˆatel, Switzerland.
Pseudocode Yes Algorithm 2 ED-BIRL: Environment Design for BIRL ... Algorithm 3 ED-AIRL: Environment Design for AIRL ... Algorithm 4 Extended Value Iteration for Structured Environments ... Algorithm 5 Environment Design with Arbitrary Environments ... Algorithm 6 AIRL-ME (AIRL with Multiple Environments)
Open Source Code Yes Implementation. The code used for all of our experiments is available at github.com/Ojig/Environment-Design-for-IRL.
Open Datasets No The paper uses well-known environments like 'Minigrid' and 'MuJoCo' to generate experimental data, and also randomly generates MDPs. However, it does not provide access to a specific, static dataset that was publicly available for training.
Dataset Splits No The paper mentions 'demo environments' and 'test environments' which are disjoint, and a 'budget of m expert trajectories'. It does not explicitly specify traditional train/validation/test dataset splits with percentages or counts.
Hardware Specification Yes Compute. Three AMD EPYC 7302P machines were used.
Software Dependencies No The paper mentions that 'All of our policies were optimised with Proximal Policy Optimisation (Schulman et al., 2017)', which refers to an algorithm, but no specific software libraries or tools with version numbers are listed.
Experiment Setup Yes For all AIRL-based algorithms, we used a two-layer Re LU network with 32 units for the state-only reward approximator and shaping functions. ... All of our policies were optimised with Proximal Policy Optimisation (Schulman et al., 2017). ... Experts for base, demo and test environments for a given task were trained with identical hyperparameters and for an equal amount of timesteps.