Offline Learning in Markov Games with General Function Approximation

Authors: Yuheng Zhang, Yu Bai, Nan Jiang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium such as Nash equilibrium and (Coarse) Correlated Equilibrium from an offline dataset precollected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framework for sample-efficient offline learning in Markov games under general function approximation, handling all 3 equilibria in a unified manner. By using Bellman-consistent pessimism, we obtain interval estimation for policies returns, and use both the upper and the lower bounds to obtain a relaxation on the gap of a candidate policy, which becomes our optimization objective. Our results generalize prior works and provide several additional insights. Importantly, we require a data coverage condition that improves over the recently proposed unilateral concentrability . Our condition allows selective coverage of deviation policies that optimally tradeoff between their greediness (as approximate best responses) and coverage, and we show scenarios where this leads to significantly better guarantees. As a new connection, we also show how our algorithmic framework can subsume seemingly different solution concepts designed for the special case of two-player zero-sum games.
Researcher Affiliation Collaboration 1University of Illinois at Urbana-Champaign 2 Salesforce AI Research. Correspondence to: Nan Jiang <nanjiang@illinois.edu>.
Pseudocode Yes Algorithm 1 Bellman-Consistent Equilibrium Learning (BCEL) from an Offline Dataset
Open Source Code No The paper does not provide any statements about open-sourcing code or links to a code repository.
Open Datasets No The paper is theoretical and does not use or make available any specific public datasets for training. It discusses a 'pre-collected historical dataset' and a 'data distribution d D (S A)' as abstract concepts for its theoretical framework.
Dataset Splits No The paper is theoretical and does not specify training/test/validation dataset splits, as it does not conduct experiments on real datasets.
Hardware Specification No The paper focuses on theoretical analysis and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not mention specific software or libraries with version numbers, as it does not report on empirical implementations or experiments.
Experiment Setup No The paper is theoretical and does not describe specific experimental setup details such as hyperparameters or training configurations for empirical evaluation.