Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms
Authors: Liyuan Zheng, Tanner Fiez, Zane Alumbaugh, Benjamin Chasnov, Lillian J. Ratliff9217-9224
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | From an empirical standpoint, we demonstrate via simple examples that the learning dynamics we study mitigate cycling and accelerate convergence compared to the usual gradient dynamics given cost structures induced by actor-critic formulations. Finally, experiments on Open AI gym environments show that Stackelberg actor-critic algorithms always perform at least as well and often significantly outperform the standard actor-critic algorithm counterparts. |
| Researcher Affiliation | Academia | 1University of Washington 2University of California, Santa Cruz {liyuanz8,fiezt,bchasnov,ratliffl}@uw.edu, zanedma@gmail.com |
| Pseudocode | Yes | Algorithm 1: Stackelberg Actor-Critic Framework |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the methodology described. |
| Open Datasets | No | We run experiments on the Open AI gym platform (Brockman et al. 2016) with the Mujoco Physics simulator (Todorov, Erez, and Tassa 2012). The paper mentions environments, not a specific dataset with access information. |
| Dataset Splits | No | The paper discusses experiments run on Open AI gym environments but does not specify any explicit training/test/validation dataset splits or cross-validation setup. |
| Hardware Specification | No | The paper states that experiments were run on the Open AI gym platform with Mujoco, but it does not provide specific hardware details such as CPU or GPU models used for these experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in the implementation. |
| Experiment Setup | Yes | We use a learning rate of 10−4 for the actor and 3 × 10−4 for the critic in all experiments. For DDPG and SAC, the actor and critic target networks are updated by Polyak averaging with update rate 0.005. The batch size is 256 for all algorithms. For STAC, STDDPG, and STSAC, the regularization parameter λ = 0.001. All networks are fully connected neural networks with two hidden layers of size 256 for both actor and critic networks. |