Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games
Authors: Shicong Cen, Yuejie Chi, Simon Shaolei Du, Lin Xiao
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. |
| Researcher Affiliation | Collaboration | Shicong Cen Carnegie Mellon University shicongc@andrew.cmu.edu Yuejie Chi Carnegie Mellon University yuejiechi@cmu.edu Simon S. Du University of Washington ssdu@cs.washington.edu Lin Xiao Meta AI Research linx@fb.com |
| Pseudocode | Yes | Algorithm 1: Entropy-regularized OMWU for Discounted Two-player Zero-sum Markov Game. Algorithm 2: Entropy-regularized OMWU for Episodic Two-player Zero-sum Markov Game. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the authors have open-sourced their code. The discussion section mentions designing "sample-efficient implementations" as future work, implying that an implementation for release is not yet available. |
| Open Datasets | No | The paper is theoretical and does not describe any experiments that would involve training on a specific dataset. Therefore, no information about publicly available datasets is provided. |
| Dataset Splits | No | The paper is theoretical and does not conduct empirical studies that would require specifying training, validation, or test dataset splits. |
| Hardware Specification | No | The paper is purely theoretical and does not describe any computational experiments or the specific hardware (e.g., GPU models, CPU types, memory) used to run them. |
| Software Dependencies | No | The paper is theoretical and does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks) used for experiments. |
| Experiment Setup | No | The paper is theoretical and does not present any empirical experimental setup details, such as hyperparameter values, training schedules, or system-level configurations. It provides theoretical bounds for learning rates but these are not for empirical setup. |