reproducibilityindex.ai

Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents

Authors: John L Zhou, Weizhe Hong, Jonathan Kao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments using two commonly used SSDs of varied complexity to demonstrate the shaping abilities of Reciprocators against other types of learning agents.
Researcher Affiliation	Academia	John L. Zhou Weizhe Hong Jonathan C. Kao University of California, Los Angeles john.ly.zhou@gmail.com
Pseudocode	Yes	Algorithm 1 Training with Reciprocal Reward Influence vs. Agent i
Open Source Code	Yes	Our code is available at https://github.com/johnlyzhou/reciprocator/.
Open Datasets	Yes	Iterated Prisoners Dilemma (IPD): The iterated prisoners dilemma (IPD) is a temporally extended version of the classical thought experiment, in which two prisoners are given a choice to either stay silent/cooperate (C) or confess/defect (D) with rewards given in Table 1a. Coins: Coins is a temporally extended variant of the IPD introduced by Lerer & Peysakhovich (2018).
Dataset Splits	No	The paper describes training and testing procedures but does not explicitly detail a validation dataset split or a methodology for using a separate validation set during training for hyperparameter tuning or early stopping. It mentions using experience replay and updating target networks but not a distinct data split for validation.
Hardware Specification	Yes	All experiments were run on Nvidia 3070 GPUs with 8 GB of VRAM.
Software Dependencies	No	The paper mentions using "proximal policy optimization using a clipped surrogate objective (Schulman et al., 2017, PPO-Clip)" and the "Adam" optimizer, and adapting code from "Lu et al. (2022)", but it does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or specific environments.
Experiment Setup	Yes	For rollout-based experiments, we implement all policy gradient-based agents using actor-critic architectures trained with proximal policy optimization using a clipped surrogate objective (Schulman et al., 2017, PPO-Clip). ... Additional hyperparameter values and network architecture details can be found in Appendix A. Table 2: General PPO parameters. Table 3: Reciprocator-specific parameters.