Cooperative Open-ended Learning Framework for Zero-Shot Coordination
Authors: Yang Li, Shao Zhang, Jichen Sun, Yali Du, Ying Wen, Xinbing Wang, Wei Pan
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results in the Overcooked game environment demonstrate that our method outperforms current state-of-the-art methods when coordinating with different-level partners. |
| Researcher Affiliation | Academia | 1The University of Manchester 2Shanghai Jiao Tong University 3King s College London. |
| Pseudocode | Yes | Algorithm 1 COLESV Algorithm; Algorithm 2 Graphic Shapley Value Solver Algorithm |
| Open Source Code | No | The paper provides a link to a 'demo' page (https://sites.google.com/view/cole-2023/) but does not explicitly state that the source code for the methodology itself is available at this link or elsewhere. |
| Open Datasets | Yes | In this paper, we conduct a series of experiments in the Overcooked environment (Carroll et al., 2019; Charakorn et al., 2020; Knott et al., 2021). |
| Dataset Splits | No | The paper discusses evaluating performance with 'different level partners' (middle-level and expert) but does not provide specific train/validation/test dataset splits or details about how validation data was partitioned or used for model selection/hyperparameter tuning in the traditional sense of fixed datasets. |
| Hardware Specification | Yes | 1-GPU node with NVIDIA Ge Force 3090Ti 24G as GPU and AMD EPYC 7H12 64-Core Processor as CPU, 2) 2-GPUs node with Ge Force RTX 3090 24G as GPU and AMD Ryzen Threadripper 3970X 32-Core Processor as CPU. |
| Software Dependencies | No | The paper mentions using Proximal Policy Optimization (PPO) as the RL algorithm, but does not provide specific version numbers for PPO or any other key software components, libraries, or programming languages used. |
| Experiment Setup | Yes | The learning rate for each layout is 2e-3 , 1e-3 , 6e-4 , 8e-4 , and 8e-4. The gamma is 0.99. The lambda is 0.98. The PPO clipping factor is 0.05. The VF coefficient is 0.5. The maximum gradient norm is 0.1. The total training time steps for each PPO update is 48000, divided into 10 mini-batches. The total numbers of generations for each layout are 80, 60, 75, 70, and 70, respectively. For each generation, we update 10 times to approximate the best-preferred strategy. The is 1. |