Learning to Play General-Sum Games against Multiple Boundedly Rational Agents
Authors: Eric Zhao, Alexander R. Trott, Caiming Xiong, Stephan Zheng
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate our framework learns robust mechanisms in both matrix games and complex spatiotemporal games. In particular, we learn a dynamic tax policy that improves the welfare of a simulated trade-and-barter economy by 15%, even when facing previously unseen boundedly rational RL taxpayers. |
| Researcher Affiliation | Collaboration | 1Salesforce Research. Palo Alto, California, USA 2University of California, Berkeley. Berkeley, California, USA 3Mosaic ML. San Francisco, California, USA |
| Pseudocode | Yes | Algorithm 1: Decoupled sampling of pessimistic equilibria. |
| Open Source Code | Yes | Source code for these experiments are released at https:// github.com/salesforce/strategically-robust-ai. |
| Open Datasets | No | The paper describes using simulated game environments ('Sequential Bimatrix Game', 'AI Economist') rather than publicly available datasets with concrete access information (URL, DOI, or repository). |
| Dataset Splits | No | The paper mentions selecting top 10 seeds 'in a validation environment' but does not provide specific details on how this validation set is created or split from the overall data/simulation for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or any other computer specifications used for running experiments. |
| Software Dependencies | No | The paper mentions using a 'common multi-agent implementation of the PPO algorithm' but does not specify any software libraries or dependencies with version numbers. |
| Experiment Setup | Yes | Output: Approximate lower-bound on L(ϵ) (Eq 5). Input: Number of training steps Mtr and self-play steps Ms, reward slack ϵ, multiplier learning rate αλ, uncoupled self-play algorithm B, regret estimators Ri : P(A) R for each agent i. Initialize mixed strategy x1. for j = 1, . . . , Mtr do for i = 1, . . . , N do Estimate regret ri as ˆri Ri(xj), where ri := max xi P (Ai) ui( xi, x i) ui(x). Compute multiplier λi λi αλ (ˆri ϵ). end for Using B, run Ms rounds of self-play with utilities ˆui(a) := (λiui(a) u0(a))/(1 + λi). Set xj+1 as the resulting empirical play distribution. end for Return 1 Mtr PMtr t=1 u0(xt). |