Interactive Inverse Reinforcement Learning for Cooperative Games
Authors: Thomas Kleine Büning, Anne-Marie George, Christos Dimitrakakis
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments support our theoretical results and show that the interactive nature of our setting allows the learning agent to obtain a much better estimate of the reward function (compared to the standard IRL setting). We thus achieve better cooperation by intelligently probing the human s responses. |
| Researcher Affiliation | Academia | 1Department of Informatics, University of Oslo, Oslo, Norway 2Department of Computer Science, University of Neuchatel, Neuchatel, Switzerland 3Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden. |
| Pseudocode | Yes | Algorithm 1 Interactive IRL via Linear Programming |
| Open Source Code | Yes | The code is available at https://github.com/Interactive IRL/src. |
| Open Datasets | No | The paper describes two custom environments: "Maze-Maker" and "Random MDPs" which were generated for the experiments. It does not provide access information (link, DOI, citation) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific details on training, validation, or test splits. It mentions using "repeatedly generating responses" and averaging results over multiple runs. |
| Hardware Specification | Yes | The experiments were carried out on a virtual machine with 32 CPUs, 60GB RAM, and Cent OS Linux 8 operating system. |
| Software Dependencies | Yes | The experiments were implemented in Python 3.7 and the libraries matplotlib 3.2.1, numpy 1.20.1, and scipy 1.6.2 (for the linear program) were used. |
| Experiment Setup | Yes | For the case of suboptimal responses and partial information, we assume that A2 responds with Boltzmann-rational policies with inverse temperature β = 10 in both environments. ... We let an episode end with probability 1 γ = 0.1 each time step... We impose a minimal trajectory length of 2 time steps to prevent vacuous episodes. ... We assume that any attempted move of the cart succeeds with probability 0.8 and that with probability 0.2 the cart moves to a random neighbouring cell. |