Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

Authors: Shicong Cen, Yuejie Chi, Simon Shaolei Du, Lin Xiao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization.
Researcher Affiliation Collaboration Shicong Cen Carnegie Mellon University shicongc@andrew.cmu.edu Yuejie Chi Carnegie Mellon University yuejiechi@cmu.edu Simon S. Du University of Washington ssdu@cs.washington.edu Lin Xiao Meta AI Research linx@fb.com
Pseudocode Yes Algorithm 1: Entropy-regularized OMWU for Discounted Two-player Zero-sum Markov Game. Algorithm 2: Entropy-regularized OMWU for Episodic Two-player Zero-sum Markov Game.
Open Source Code No The paper does not provide any statement or link indicating that the authors have open-sourced their code. The discussion section mentions designing "sample-efficient implementations" as future work, implying that an implementation for release is not yet available.
Open Datasets No The paper is theoretical and does not describe any experiments that would involve training on a specific dataset. Therefore, no information about publicly available datasets is provided.
Dataset Splits No The paper is theoretical and does not conduct empirical studies that would require specifying training, validation, or test dataset splits.
Hardware Specification No The paper is purely theoretical and does not describe any computational experiments or the specific hardware (e.g., GPU models, CPU types, memory) used to run them.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks) used for experiments.
Experiment Setup No The paper is theoretical and does not present any empirical experimental setup details, such as hyperparameter values, training schedules, or system-level configurations. It provides theoretical bounds for learning rates but these are not for empirical setup.