reproducibilityindex.ai

Learning to Incentivize Other Learning Agents

Authors: Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, Hongyuan Zha

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate in experiments that such agents signiﬁcantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games, often by ﬁnding a near-optimal division of labor.
Researcher Affiliation	Collaboration	1Georgia Institute of Technology 2Deep Mind 3AIRS and Chinese University of Hong Kong, Shenzhen
Pseudocode	Yes	Algorithm 1 Learning to Incentivize Others
Open Source Code	Yes	Code for all experiments is available at https://github.com/011235813/lio
Open Datasets	Yes	Iterated Prisoner s Dilemma (IPD). We test LIO on the memory-1 IPD as deﬁned in [12]... N-Player Escape Room (ER). We experiment on the N-player Escape Room game shown in Figure 1 (Section 1)... Cleanup. Furthermore, we conduct experiments on the Cleanup game (Figure 3) [18, 42].
Dataset Splits	No	The paper describes experiments in the Iterated Prisoner's Dilemma, N-Player Escape Room, and Cleanup game environments. These are typically interactive simulations rather than static datasets with explicit train/validation/test splits. The paper does not specify any dataset splits for these environments.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. It only mentions affiliations with DeepMind and Google, which implies access to significant computational resources but no specific hardware specifications.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies (e.g., deep learning frameworks, programming languages, or libraries) used in the experiments.
Experiment Setup	Yes	We chose Rmax = [3, 2, 2] for [IPD, ER, Cleanup], respectively... We use on-policy learning with policy gradient for each agent in IPD and ER, and actor-critic for Cleanup. To ensure that all agents policies perform sufﬁcient exploration for the effect of incentives to be discovered, we include an exploration lower bound ϵ such that π(a\|s) = (1 ϵ)π(a\|s) + ϵ/\|A\|, with linearly decreasing ϵ.