Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents
Authors: John L Zhou, Weizhe Hong, Jonathan Kao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments using two commonly used SSDs of varied complexity to demonstrate the shaping abilities of Reciprocators against other types of learning agents. |
| Researcher Affiliation | Academia | John L. Zhou Weizhe Hong Jonathan C. Kao University of California, Los Angeles john.ly.zhou@gmail.com |
| Pseudocode | Yes | Algorithm 1 Training with Reciprocal Reward Influence vs. Agent i |
| Open Source Code | Yes | Our code is available at https://github.com/johnlyzhou/reciprocator/. |
| Open Datasets | Yes | Iterated Prisoners Dilemma (IPD): The iterated prisoners dilemma (IPD) is a temporally extended version of the classical thought experiment, in which two prisoners are given a choice to either stay silent/cooperate (C) or confess/defect (D) with rewards given in Table 1a. Coins: Coins is a temporally extended variant of the IPD introduced by Lerer & Peysakhovich (2018). |
| Dataset Splits | No | The paper describes training and testing procedures but does not explicitly detail a validation dataset split or a methodology for using a separate validation set during training for hyperparameter tuning or early stopping. It mentions using experience replay and updating target networks but not a distinct data split for validation. |
| Hardware Specification | Yes | All experiments were run on Nvidia 3070 GPUs with 8 GB of VRAM. |
| Software Dependencies | No | The paper mentions using "proximal policy optimization using a clipped surrogate objective (Schulman et al., 2017, PPO-Clip)" and the "Adam" optimizer, and adapting code from "Lu et al. (2022)", but it does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or specific environments. |
| Experiment Setup | Yes | For rollout-based experiments, we implement all policy gradient-based agents using actor-critic architectures trained with proximal policy optimization using a clipped surrogate objective (Schulman et al., 2017, PPO-Clip). ... Additional hyperparameter values and network architecture details can be found in Appendix A. Table 2: General PPO parameters. Table 3: Reciprocator-specific parameters. |