reproducibilityindex.ai

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Authors: Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, Thore Graepel

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we ﬁrst observe that policies learned using In RL can overﬁt to the other agents policies during training, failing to sufﬁciently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe an algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game-theoretic analysis to compute meta-strategies for policy selection. The algorithm generalizes previous ones such as In RL, iterated best response, double oracle, and ﬁctitious play. Then, we present a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in two partially observable settings: gridworld coordination games and poker.
Researcher Affiliation	Industry	Marc Lanctot Deep Mind lanctot@ Vinicius Zambaldi Deep Mind vzambaldi@ Audr unas Gruslys Deep Mind audrunas@ Angeliki Lazaridou Deep Mind angeliki@ Karl Tuyls Deep Mind karltuyls@ Julien Pérolat Deep Mind perolat@ David Silver Deep Mind davidsilver@ Thore Graepel Deep Mind thore@ ...@google.com
Pseudocode	Yes	Algorithm 1: Policy-Space Response Oracles Algorithm 2: Deep Cognitive Hierarchies
Open Source Code	No	The paper refers to 'Appendix C' for implementation details and to reference [55] (an arXiv paper) for a longer technical report version, but does not provide a direct link to a code repository or explicitly state that the source code for their method is being released.
Open Datasets	No	The paper describes custom environments ('First-Person Gridworld Games' and 'Leduc Poker') but does not provide concrete access information (link, DOI, citation) to any publicly available training datasets or pre-existing datasets used for training. Leduc Poker is a game setup, not a dataset.
Dataset Splits	No	The paper describes running experiments for a certain number of episodes (e.g., '100 episodes') but does not specify formal training/validation/test dataset splits or their percentages, which is typical for fixed datasets rather than continuous environment interaction.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. It only mentions the affiliation 'Deep Mind'.
Software Dependencies	No	The paper mentions 'Reactor' and 'Retrace(λ)' for learning, and refers to 'Adam' as an optimizer, but does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	Each agent has a local ﬁeld-of-view (making the world partially observable), sees 17 spaces in front, 10 to either side, and 2 spaces behind. Consequently, observations are encoded as 21x20x3 RGB tensors with values 0 255. Each agent has a choice of turning left or right, moving forward or backward, stepping left or right, not moving, or casting an endless light beam in their current direction. In addition, the agent has two composed actions of moving forward and turning. Actions are executed simultaneously, and order of resolution is randomized. Agents start on a random spawn point at the beginning of each episode. If an agent is touched ( tagged ) by another agent s light beam twice, then the target agent is immediately teleported to a spawn point. In laser tag, the source agent then receives 1 point of reward for the tag. In another variant, gathering, there is no tagging but agents can collect apples, for 1 point per apple, which refresh at a ﬁxed rate. In pathﬁnd, there is no tagging nor apples, and both agents get 1 point reward when both reach their destinations, ending the episode. In every variant, an episode consists of 1000 steps of simulation. Other details, such as speciﬁc maps, can be found in Appendix D.