Actor-Critic Policy Optimization in Partially Observable Multiagent Environments
Authors: Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Perolat, Karl Tuyls, Remi Munos, Michael Bowling
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions. |
| Researcher Affiliation | Industry | Sriram Srinivasan ,1 srsrinivasan@ Marc Lanctot ,1 lanctot@ Vinicius Zambaldi1 vzambaldi@ Julien Pérolat1 perolat@ Karl Tuyls1 karltuyls@ Rémi Munos1 munos@ Michael Bowling1 bowlingm@ ...@google.com. 1Deep Mind. |
| Pseudocode | Yes | The pseudo-code is given in Algorithm 2 in Appendix C. |
| Open Source Code | No | The paper does not provide an explicit statement about making the source code available or a link to a code repository. |
| Open Datasets | Yes | We evaluate the actor-critic algorithms on two n-player games: Kuhn poker, and Leduc poker. ... To remain consistent with other baselines, we use the form of Leduc described in [50] which does not restrict the action space, adding reward penalties if/when illegal moves are chosen. |
| Dataset Splits | No | The paper does not specify explicit train/validation/test dataset splits with percentages or sample counts. In the context of reinforcement learning, data is typically generated through interaction with an environment rather than pre-defined static splits. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments. It only mentions general concepts like "neural networks". |
| Software Dependencies | No | The paper does not provide specific software dependencies or version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | These updates were done using separate SGD optimizers with their respective learning rates of fixed 0.001 for policy evaluation, and annealed from a starting learning rate to 0 over 20M steps for policy improvement. ... The temperature is annealed from 1 to 0 over 1M steps to ensure adequate state space coverage. An additional entropy cost hyper-parameter is added as is standard practice with Deep RL policy gradient methods such as A3C [59, 77]. |