Interactive Inverse Reinforcement Learning for Cooperative Games

Authors: Thomas Kleine Büning, Anne-Marie George, Christos Dimitrakakis

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments support our theoretical results and show that the interactive nature of our setting allows the learning agent to obtain a much better estimate of the reward function (compared to the standard IRL setting). We thus achieve better cooperation by intelligently probing the human s responses.
Researcher Affiliation Academia 1Department of Informatics, University of Oslo, Oslo, Norway 2Department of Computer Science, University of Neuchatel, Neuchatel, Switzerland 3Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden.
Pseudocode Yes Algorithm 1 Interactive IRL via Linear Programming
Open Source Code Yes The code is available at https://github.com/Interactive IRL/src.
Open Datasets No The paper describes two custom environments: "Maze-Maker" and "Random MDPs" which were generated for the experiments. It does not provide access information (link, DOI, citation) for a publicly available or open dataset.
Dataset Splits No The paper does not provide specific details on training, validation, or test splits. It mentions using "repeatedly generating responses" and averaging results over multiple runs.
Hardware Specification Yes The experiments were carried out on a virtual machine with 32 CPUs, 60GB RAM, and Cent OS Linux 8 operating system.
Software Dependencies Yes The experiments were implemented in Python 3.7 and the libraries matplotlib 3.2.1, numpy 1.20.1, and scipy 1.6.2 (for the linear program) were used.
Experiment Setup Yes For the case of suboptimal responses and partial information, we assume that A2 responds with Boltzmann-rational policies with inverse temperature β = 10 in both environments. ... We let an episode end with probability 1 γ = 0.1 each time step... We impose a minimal trajectory length of 2 time steps to prevent vacuous episodes. ... We assume that any attempted move of the cart succeeds with probability 0.8 and that with probability 0.2 the cart moves to a random neighbouring cell.