Learning to Play General-Sum Games against Multiple Boundedly Rational Agents

Authors: Eric Zhao, Alexander R. Trott, Caiming Xiong, Stephan Zheng

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate our framework learns robust mechanisms in both matrix games and complex spatiotemporal games. In particular, we learn a dynamic tax policy that improves the welfare of a simulated trade-and-barter economy by 15%, even when facing previously unseen boundedly rational RL taxpayers.
Researcher Affiliation Collaboration 1Salesforce Research. Palo Alto, California, USA 2University of California, Berkeley. Berkeley, California, USA 3Mosaic ML. San Francisco, California, USA
Pseudocode Yes Algorithm 1: Decoupled sampling of pessimistic equilibria.
Open Source Code Yes Source code for these experiments are released at https:// github.com/salesforce/strategically-robust-ai.
Open Datasets No The paper describes using simulated game environments ('Sequential Bimatrix Game', 'AI Economist') rather than publicly available datasets with concrete access information (URL, DOI, or repository).
Dataset Splits No The paper mentions selecting top 10 seeds 'in a validation environment' but does not provide specific details on how this validation set is created or split from the overall data/simulation for reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or any other computer specifications used for running experiments.
Software Dependencies No The paper mentions using a 'common multi-agent implementation of the PPO algorithm' but does not specify any software libraries or dependencies with version numbers.
Experiment Setup Yes Output: Approximate lower-bound on L(ϵ) (Eq 5). Input: Number of training steps Mtr and self-play steps Ms, reward slack ϵ, multiplier learning rate αλ, uncoupled self-play algorithm B, regret estimators Ri : P(A) R for each agent i. Initialize mixed strategy x1. for j = 1, . . . , Mtr do for i = 1, . . . , N do Estimate regret ri as ˆri Ri(xj), where ri := max xi P (Ai) ui( xi, x i) ui(x). Compute multiplier λi λi αλ (ˆri ϵ). end for Using B, run Ms rounds of self-play with utilities ˆui(a) := (λiui(a) u0(a))/(1 + λi). Set xj+1 as the resulting empirical play distribution. end for Return 1 Mtr PMtr t=1 u0(xt).