Sample-Efficient Multi-Agent RL: An Optimization Perspective

Authors: Nuoya Xiong, Zhihan Liu, Zhaoran Wang, Zhuoran Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games (MGs) under general function approximation. In order to find the minimum assumption for sample-efficient learning, we introduce a novel complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for general-sum MGs. Using this measure, we propose the first unified algorithmic framework that ensures sample efficiency in learning Nash Equilibrium, Coarse Correlated Equilibrium, and Correlated Equilibrium for both model-based and model-free MARL problems with low MADC. We also show that our algorithm provides comparable sublinear regret to the existing works. Moreover, our algorithm only requires an equilibrium-solving oracle and an oracle that solves regularized supervised learning, and thus avoids solving constrained optimization problems within data-dependent constraints (Jin et al., 2020a; Wang et al., 2023) or executing sampling procedures with complex multi-objective optimization problems (Foster et al., 2023). Moreover, the model-free version of our algorithms is the first provably efficient model-free algorithm for learning Nash equilibrium of general-sum MGs.
Researcher Affiliation Academia Nuoya Xiong IIIS, Tsinghua University xiongny20@mails.tsinghua.edu.cn Zhihan Liu Northwestern University zhihanliu2027@u.northwestern.edu Zhaoran Wang Northwestern University zhaoranwang@gmail.com Zhuoran Yang Yale University zhuoran.yang@yale.edu
Pseudocode Yes Algorithm 1 Multi-Agent Maximize-to-EXplore (MAMEX)
Open Source Code No The paper does not provide any statements or links indicating that open-source code for the described methodology is available.
Open Datasets No The paper is theoretical and describes data collection for online learning ('data collected via online interactions') but does not specify or provide access to any particular public or open dataset used for training.
Dataset Splits No The paper is theoretical and does not perform empirical experiments, thus no dataset splits (training, validation, test) are specified for reproducibility.
Hardware Specification No The paper is theoretical and does not mention any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not list any specific software dependencies with version numbers required for reproducibility.
Experiment Setup No The paper is theoretical and discusses algorithmic parameters like 'η' but does not provide details of an experimental setup such as hyperparameters or system-level training settings for empirical validation.