Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Authors: Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Perolat, Karl Tuyls, Remi Munos, Michael Bowling

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.
Researcher Affiliation Industry Sriram Srinivasan ,1 srsrinivasan@ Marc Lanctot ,1 lanctot@ Vinicius Zambaldi1 vzambaldi@ Julien Pérolat1 perolat@ Karl Tuyls1 karltuyls@ Rémi Munos1 munos@ Michael Bowling1 bowlingm@ ...@google.com. 1Deep Mind.
Pseudocode Yes The pseudo-code is given in Algorithm 2 in Appendix C.
Open Source Code No The paper does not provide an explicit statement about making the source code available or a link to a code repository.
Open Datasets Yes We evaluate the actor-critic algorithms on two n-player games: Kuhn poker, and Leduc poker. ... To remain consistent with other baselines, we use the form of Leduc described in [50] which does not restrict the action space, adding reward penalties if/when illegal moves are chosen.
Dataset Splits No The paper does not specify explicit train/validation/test dataset splits with percentages or sample counts. In the context of reinforcement learning, data is typically generated through interaction with an environment rather than pre-defined static splits.
Hardware Specification No The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments. It only mentions general concepts like "neural networks".
Software Dependencies No The paper does not provide specific software dependencies or version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes These updates were done using separate SGD optimizers with their respective learning rates of fixed 0.001 for policy evaluation, and annealed from a starting learning rate to 0 over 20M steps for policy improvement. ... The temperature is annealed from 1 to 0 over 1M steps to ensure adequate state space coverage. An additional entropy cost hyper-parameter is added as is standard practice with Deep RL policy gradient methods such as A3C [59, 77].