Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents

Authors: John L Zhou, Weizhe Hong, Jonathan Kao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments using two commonly used SSDs of varied complexity to demonstrate the shaping abilities of Reciprocators against other types of learning agents.
Researcher Affiliation Academia John L. Zhou Weizhe Hong Jonathan C. Kao University of California, Los Angeles john.ly.zhou@gmail.com
Pseudocode Yes Algorithm 1 Training with Reciprocal Reward Influence vs. Agent i
Open Source Code Yes Our code is available at https://github.com/johnlyzhou/reciprocator/.
Open Datasets Yes Iterated Prisoners Dilemma (IPD): The iterated prisoners dilemma (IPD) is a temporally extended version of the classical thought experiment, in which two prisoners are given a choice to either stay silent/cooperate (C) or confess/defect (D) with rewards given in Table 1a. Coins: Coins is a temporally extended variant of the IPD introduced by Lerer & Peysakhovich (2018).
Dataset Splits No The paper describes training and testing procedures but does not explicitly detail a validation dataset split or a methodology for using a separate validation set during training for hyperparameter tuning or early stopping. It mentions using experience replay and updating target networks but not a distinct data split for validation.
Hardware Specification Yes All experiments were run on Nvidia 3070 GPUs with 8 GB of VRAM.
Software Dependencies No The paper mentions using "proximal policy optimization using a clipped surrogate objective (Schulman et al., 2017, PPO-Clip)" and the "Adam" optimizer, and adapting code from "Lu et al. (2022)", but it does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or specific environments.
Experiment Setup Yes For rollout-based experiments, we implement all policy gradient-based agents using actor-critic architectures trained with proximal policy optimization using a clipped surrogate objective (Schulman et al., 2017, PPO-Clip). ... Additional hyperparameter values and network architecture details can be found in Appendix A. Table 2: General PPO parameters. Table 3: Reciprocator-specific parameters.