Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games
Authors: Shicong Cen, Yuejie Chi, Simon Shaolei Du, Lin Xiao
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. |
| Researcher Affiliation | Collaboration | Shicong Cen Carnegie Mellon University EMAIL Yuejie Chi Carnegie Mellon University EMAIL Simon S. Du University of Washington EMAIL Lin Xiao Meta AI Research EMAIL |
| Pseudocode | Yes | Algorithm 1: Entropy-regularized OMWU for Discounted Two-player Zero-sum Markov Game. Algorithm 2: Entropy-regularized OMWU for Episodic Two-player Zero-sum Markov Game. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the authors have open-sourced their code. The discussion section mentions designing "sample-efficient implementations" as future work, implying that an implementation for release is not yet available. |
| Open Datasets | No | The paper is theoretical and does not describe any experiments that would involve training on a specific dataset. Therefore, no information about publicly available datasets is provided. |
| Dataset Splits | No | The paper is theoretical and does not conduct empirical studies that would require specifying training, validation, or test dataset splits. |
| Hardware Specification | No | The paper is purely theoretical and does not describe any computational experiments or the specific hardware (e.g., GPU models, CPU types, memory) used to run them. |
| Software Dependencies | No | The paper is theoretical and does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks) used for experiments. |
| Experiment Setup | No | The paper is theoretical and does not present any empirical experimental setup details, such as hyperparameter values, training schedules, or system-level configurations. It provides theoretical bounds for learning rates but these are not for empirical setup. |