Learning to Incentivize Other Learning Agents
Authors: Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, Hongyuan Zha
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate in experiments that such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games, often by finding a near-optimal division of labor. |
| Researcher Affiliation | Collaboration | 1Georgia Institute of Technology 2Deep Mind 3AIRS and Chinese University of Hong Kong, Shenzhen |
| Pseudocode | Yes | Algorithm 1 Learning to Incentivize Others |
| Open Source Code | Yes | Code for all experiments is available at https://github.com/011235813/lio |
| Open Datasets | Yes | Iterated Prisoner s Dilemma (IPD). We test LIO on the memory-1 IPD as defined in [12]... N-Player Escape Room (ER). We experiment on the N-player Escape Room game shown in Figure 1 (Section 1)... Cleanup. Furthermore, we conduct experiments on the Cleanup game (Figure 3) [18, 42]. |
| Dataset Splits | No | The paper describes experiments in the Iterated Prisoner's Dilemma, N-Player Escape Room, and Cleanup game environments. These are typically interactive simulations rather than static datasets with explicit train/validation/test splits. The paper does not specify any dataset splits for these environments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. It only mentions affiliations with DeepMind and Google, which implies access to significant computational resources but no specific hardware specifications. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies (e.g., deep learning frameworks, programming languages, or libraries) used in the experiments. |
| Experiment Setup | Yes | We chose Rmax = [3, 2, 2] for [IPD, ER, Cleanup], respectively... We use on-policy learning with policy gradient for each agent in IPD and ER, and actor-critic for Cleanup. To ensure that all agents policies perform sufficient exploration for the effect of incentives to be discovered, we include an exploration lower bound ϵ such that π(a|s) = (1 ϵ)π(a|s) + ϵ/|A|, with linearly decreasing ϵ. |