Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences
Authors: Bowen Baker
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show evidence of emergent direct reciprocity, indirect reciprocity and reputation, and team formation when training agents with randomized uncertain social preferences (RUSP), a novel environment augmentation that expands the distribution of environments agents play in. For all experiments, agent policies are recurrent entity-invariant neural networks similar to Baker et al.(4) trained with proximal policy optimization (PPO),(43) an on-policy reinforcement learning algorithm; see Appendix C for more details on the policy architecture and policy optimization. For all plots, one training iteration comprises 60 steps of stochastic gradient descent on the PPO objective. Figure 3 shows the effect of training in IPD with randomized social preferences and varying levels of uncertainty. |
| Researcher Affiliation | Industry | Bowen Baker Open AI bowen@openai.com |
| Pseudocode | No | The paper describes its methods in prose and through diagrams but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-source our environments for further research into social dilemmas.1 Environment code will be available at github.com/openai/multi-agent-emergence-environments |
| Open Datasets | No | The paper open-sources its custom environments, but does not provide concrete access information (link, DOI, formal citation) for a specific, pre-existing publicly available or open dataset used for training, validation, or testing. |
| Dataset Splits | No | The paper describes experiments conducted within simulated environments where data is generated on-the-fly. It does not provide specific training/validation/test dataset splits (percentages or sample counts) as it pertains to a fixed dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or cloud computing instance types, used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'proximal policy optimization (PPO)' and 'MUJOCO' but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Episode lengths are sampled from a geometric distribution with stopping probability 0.1, which is equivalent to an infinite horizon game with discount factor γ = 0.9 with a mean horizon of 10. During each episode agents are partitioned on to randomized soft teams, meaning that they share rewards but may prioritize teammates more or less than themselves, rather than hard teams with completely shared rewards. For all experiments, agent policies are recurrent entity-invariant neural networks similar to Baker et al.(4) trained with proximal policy optimization (PPO),(43) an on-policy reinforcement learning algorithm; see Appendix C for more details on the policy architecture and policy optimization. For all plots, one training iteration comprises 60 steps of stochastic gradient descent on the PPO objective. |