reproducibilityindex.ai

Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Authors: Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Perolat, Karl Tuyls, Remi Munos, Michael Bowling

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against ﬁxed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-speciﬁc state space reductions.
Researcher Affiliation	Industry	Sriram Srinivasan ,1 srsrinivasan@ Marc Lanctot ,1 lanctot@ Vinicius Zambaldi1 vzambaldi@ Julien Pérolat1 perolat@ Karl Tuyls1 karltuyls@ Rémi Munos1 munos@ Michael Bowling1 bowlingm@ ...@google.com. 1Deep Mind.
Pseudocode	Yes	The pseudo-code is given in Algorithm 2 in Appendix C.
Open Source Code	No	The paper does not provide an explicit statement about making the source code available or a link to a code repository.
Open Datasets	Yes	We evaluate the actor-critic algorithms on two n-player games: Kuhn poker, and Leduc poker. ... To remain consistent with other baselines, we use the form of Leduc described in [50] which does not restrict the action space, adding reward penalties if/when illegal moves are chosen.
Dataset Splits	No	The paper does not specify explicit train/validation/test dataset splits with percentages or sample counts. In the context of reinforcement learning, data is typically generated through interaction with an environment rather than pre-defined static splits.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments. It only mentions general concepts like "neural networks".
Software Dependencies	No	The paper does not provide specific software dependencies or version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	These updates were done using separate SGD optimizers with their respective learning rates of ﬁxed 0.001 for policy evaluation, and annealed from a starting learning rate to 0 over 20M steps for policy improvement. ... The temperature is annealed from 1 to 0 over 1M steps to ensure adequate state space coverage. An additional entropy cost hyper-parameter is added as is standard practice with Deep RL policy gradient methods such as A3C [59, 77].