A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning
Authors: Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, Thore Graepel
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we first observe that policies learned using In RL can overfit to the other agents policies during training, failing to sufficiently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe an algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game-theoretic analysis to compute meta-strategies for policy selection. The algorithm generalizes previous ones such as In RL, iterated best response, double oracle, and fictitious play. Then, we present a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in two partially observable settings: gridworld coordination games and poker. |
| Researcher Affiliation | Industry | Marc Lanctot Deep Mind lanctot@ Vinicius Zambaldi Deep Mind vzambaldi@ Audr unas Gruslys Deep Mind audrunas@ Angeliki Lazaridou Deep Mind angeliki@ Karl Tuyls Deep Mind karltuyls@ Julien Pérolat Deep Mind perolat@ David Silver Deep Mind davidsilver@ Thore Graepel Deep Mind thore@ ...@google.com |
| Pseudocode | Yes | Algorithm 1: Policy-Space Response Oracles Algorithm 2: Deep Cognitive Hierarchies |
| Open Source Code | No | The paper refers to 'Appendix C' for implementation details and to reference [55] (an arXiv paper) for a longer technical report version, but does not provide a direct link to a code repository or explicitly state that the source code for their method is being released. |
| Open Datasets | No | The paper describes custom environments ('First-Person Gridworld Games' and 'Leduc Poker') but does not provide concrete access information (link, DOI, citation) to any publicly available training datasets or pre-existing datasets used for training. Leduc Poker is a game setup, not a dataset. |
| Dataset Splits | No | The paper describes running experiments for a certain number of episodes (e.g., '100 episodes') but does not specify formal training/validation/test dataset splits or their percentages, which is typical for fixed datasets rather than continuous environment interaction. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. It only mentions the affiliation 'Deep Mind'. |
| Software Dependencies | No | The paper mentions 'Reactor' and 'Retrace(λ)' for learning, and refers to 'Adam' as an optimizer, but does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | Each agent has a local field-of-view (making the world partially observable), sees 17 spaces in front, 10 to either side, and 2 spaces behind. Consequently, observations are encoded as 21x20x3 RGB tensors with values 0 255. Each agent has a choice of turning left or right, moving forward or backward, stepping left or right, not moving, or casting an endless light beam in their current direction. In addition, the agent has two composed actions of moving forward and turning. Actions are executed simultaneously, and order of resolution is randomized. Agents start on a random spawn point at the beginning of each episode. If an agent is touched ( tagged ) by another agent s light beam twice, then the target agent is immediately teleported to a spawn point. In laser tag, the source agent then receives 1 point of reward for the tag. In another variant, gathering, there is no tagging but agents can collect apples, for 1 point per apple, which refresh at a fixed rate. In pathfind, there is no tagging nor apples, and both agents get 1 point reward when both reach their destinations, ending the episode. In every variant, an episode consists of 1000 steps of simulation. Other details, such as specific maps, can be found in Appendix D. |