Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences

Authors: Bowen Baker

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we show evidence of emergent direct reciprocity, indirect reciprocity and reputation, and team formation when training agents with randomized uncertain social preferences (RUSP), a novel environment augmentation that expands the distribution of environments agents play in. For all experiments, agent policies are recurrent entity-invariant neural networks similar to Baker et al.(4) trained with proximal policy optimization (PPO),(43) an on-policy reinforcement learning algorithm; see Appendix C for more details on the policy architecture and policy optimization. For all plots, one training iteration comprises 60 steps of stochastic gradient descent on the PPO objective. Figure 3 shows the effect of training in IPD with randomized social preferences and varying levels of uncertainty.
Researcher Affiliation Industry Bowen Baker Open AI bowen@openai.com
Pseudocode No The paper describes its methods in prose and through diagrams but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We open-source our environments for further research into social dilemmas.1 Environment code will be available at github.com/openai/multi-agent-emergence-environments
Open Datasets No The paper open-sources its custom environments, but does not provide concrete access information (link, DOI, formal citation) for a specific, pre-existing publicly available or open dataset used for training, validation, or testing.
Dataset Splits No The paper describes experiments conducted within simulated environments where data is generated on-the-fly. It does not provide specific training/validation/test dataset splits (percentages or sample counts) as it pertains to a fixed dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or cloud computing instance types, used for running the experiments.
Software Dependencies No The paper mentions using 'proximal policy optimization (PPO)' and 'MUJOCO' but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes Episode lengths are sampled from a geometric distribution with stopping probability 0.1, which is equivalent to an infinite horizon game with discount factor γ = 0.9 with a mean horizon of 10. During each episode agents are partitioned on to randomized soft teams, meaning that they share rewards but may prioritize teammates more or less than themselves, rather than hard teams with completely shared rewards. For all experiments, agent policies are recurrent entity-invariant neural networks similar to Baker et al.(4) trained with proximal policy optimization (PPO),(43) an on-policy reinforcement learning algorithm; see Appendix C for more details on the policy architecture and policy optimization. For all plots, one training iteration comprises 60 steps of stochastic gradient descent on the PPO objective.