Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning
Authors: Jakob Foerster, Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew Botvinick, Michael Bowling
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate an exact version of BAD on a simple two-step matrix game, showing that it outperforms policy gradient methods; we then evaluate BAD on the challenging, cooperative partial-information card game Hanabi, where, in the two-player setting, it surpasses all previously published learning and hand-coded approaches, establishing a new state of the art. |
| Researcher Affiliation | Collaboration | 1University of Oxford, UK 2Work done at Deep Mind. JF has since moved to Facebook AI Research, Menlo Park, USA. 3Deep Mind, London, UK. Correspondence to: Jakob Foerster <jnf@fb.com>, Francis Song <songf@google.com>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for the matrix game with a proof-of-principle implementation of BAD is available at https://bit.ly/ 2P3YOyd. |
| Open Datasets | Yes | We then apply an approximate version to Hanabi, where BAD achieves an average score of 24.174 points in the two-player setting...BAD thus establishes a current state-of-the-art on the Hanabi Learning Environment (Bard et al., 2019) for the two player selfplay setting. |
| Dataset Splits | No | The paper does not explicitly provide details about training/validation/test dataset splits, as it concerns a reinforcement learning environment where data is generated through self-play. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for its experiments. |
| Software Dependencies | No | The paper mentions software architectures like Importance-Weighted Actor-Learner Architecture but does not provide specific software names with version numbers (e.g., PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | Advantage actor-critic agents were trained using the Importance-Weighted Actor-Learner Architecture (Espeholt et al., 2018), in particular the multi-agent implementation described in Jaderberg et al. (2018). In this framework, actors continually generate trajectories of experience (sequences of states, actions, and rewards) by having agents (self-)playing the game, which are then used by learners to perform batched gradient updates (batch size was 32 for all agents). Because the policy used to generate the trajectory can be several gradient updates behind the policy at the time of the gradient update, V-trace was applied to correct for the off-policy trajectories. The length of the trajectories, or rollouts, was 65, the maximum length of a winning game. For the BAD agent we also increased the number of sampled hands. V1 mix-in α = 0.01, 20K sampled hands, and inverse softmax temperature 100.0. |