reproducibilityindex.ai

Environment Design for Inverse Reinforcement Learning

Authors: Thomas Kleine Buening, Victor Villin, Christos Dimitrakakis

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We tackle these challenges through adaptive environment design. In our framework, the learner repeatedly interacts with the expert, with the former selecting environments to identify the reward function as quickly as possible from the expert s demonstrations in said environments. This results in improvements in both sample-efficiency and robustness, as we show experimentally, for both exact and approximate inference. ... We conduct extensive experiments to evaluate our approaches (Section 6).
Researcher Affiliation	Academia	1The Alan Turing Institute, London, UK 2Universit e de Neuchˆatel, Neuchˆatel, Switzerland.
Pseudocode	Yes	Algorithm 2 ED-BIRL: Environment Design for BIRL ... Algorithm 3 ED-AIRL: Environment Design for AIRL ... Algorithm 4 Extended Value Iteration for Structured Environments ... Algorithm 5 Environment Design with Arbitrary Environments ... Algorithm 6 AIRL-ME (AIRL with Multiple Environments)
Open Source Code	Yes	Implementation. The code used for all of our experiments is available at github.com/Ojig/Environment-Design-for-IRL.
Open Datasets	No	The paper uses well-known environments like 'Minigrid' and 'MuJoCo' to generate experimental data, and also randomly generates MDPs. However, it does not provide access to a specific, static dataset that was publicly available for training.
Dataset Splits	No	The paper mentions 'demo environments' and 'test environments' which are disjoint, and a 'budget of m expert trajectories'. It does not explicitly specify traditional train/validation/test dataset splits with percentages or counts.
Hardware Specification	Yes	Compute. Three AMD EPYC 7302P machines were used.
Software Dependencies	No	The paper mentions that 'All of our policies were optimised with Proximal Policy Optimisation (Schulman et al., 2017)', which refers to an algorithm, but no specific software libraries or tools with version numbers are listed.
Experiment Setup	Yes	For all AIRL-based algorithms, we used a two-layer Re LU network with 32 units for the state-only reward approximator and shaping functions. ... All of our policies were optimised with Proximal Policy Optimisation (Schulman et al., 2017). ... Experts for base, demo and test environments for a given task were trained with identical hyperparameters and for an equal amount of timesteps.