Toward Policy Explanations for Multi-Agent Reinforcement Learning
Authors: Kayla Boggess, Sarit Kraus, Lu Feng
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on three MARL domains demonstrate the scalability of our methods. A user study shows that the generated explanations significantly improve user performance and increase subjective ratings on metrics such as user satisfaction. |
| Researcher Affiliation | Academia | 1University of Virginia 2Bar-Ilan University {kjb5we, lu.feng}@virginia.edu, sarit@cs.biu.ac.il |
| Pseudocode | Yes | Algorithm 1 shows the proposed method, which takes the input of a policy abstraction M and a set of predicates Fc representing the completion of tasks (subgoals) in a given MARL domain. Algorithm 2 presents both the baseline and proposed methods for answering When do agents Gq do actions Aq? . |
| Open Source Code | No | The paper mentions using a third-party implementation ('Our implementation used the Shared Experience Actor-Critic [Christianos et al., 2020]') and refers to an appendix in a related paper for pseudocode, but does not provide a direct link or explicit statement that *their* implementation code for the described methods is open-source. |
| Open Datasets | Yes | The second and third domains are benchmarks taken from [Papoudakis et al., 2021]. Multi-robot warehouse (RWARE) considers multiple robotic agents cooperatively delivering requested items. Level-based foraging (LBF) considers a mixed cooperative-competitive game where agents must navigate a grid world to collect randomly scattered food. |
| Dataset Splits | No | The paper states 'All models were trained and evaluated to 10,000 steps, or until converging to the expected reward, whichever occurred first,' but does not specify the use of a distinct validation dataset split for model tuning. |
| Hardware Specification | Yes | The experiments were run on a laptop with a 1.4 GHz Quad-Core Intel i5 processor and 8 GB RAM. |
| Software Dependencies | No | The paper states 'Our implementation used the Shared Experience Actor-Critic [Christianos et al., 2020]' but does not provide specific version numbers for software dependencies like programming languages or libraries. |
| Experiment Setup | Yes | All models were trained and evaluated to 10,000 steps, or until converging to the expected reward, whichever occurred first. |