reproducibilityindex.ai

Cooperative Open-ended Learning Framework for Zero-Shot Coordination

Authors: Yang Li, Shao Zhang, Jichen Sun, Yali Du, Ying Wen, Xinbing Wang, Wei Pan

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results in the Overcooked game environment demonstrate that our method outperforms current state-of-the-art methods when coordinating with different-level partners.
Researcher Affiliation	Academia	1The University of Manchester 2Shanghai Jiao Tong University 3King s College London.
Pseudocode	Yes	Algorithm 1 COLESV Algorithm; Algorithm 2 Graphic Shapley Value Solver Algorithm
Open Source Code	No	The paper provides a link to a 'demo' page (https://sites.google.com/view/cole-2023/) but does not explicitly state that the source code for the methodology itself is available at this link or elsewhere.
Open Datasets	Yes	In this paper, we conduct a series of experiments in the Overcooked environment (Carroll et al., 2019; Charakorn et al., 2020; Knott et al., 2021).
Dataset Splits	No	The paper discusses evaluating performance with 'different level partners' (middle-level and expert) but does not provide specific train/validation/test dataset splits or details about how validation data was partitioned or used for model selection/hyperparameter tuning in the traditional sense of fixed datasets.
Hardware Specification	Yes	1-GPU node with NVIDIA Ge Force 3090Ti 24G as GPU and AMD EPYC 7H12 64-Core Processor as CPU, 2) 2-GPUs node with Ge Force RTX 3090 24G as GPU and AMD Ryzen Threadripper 3970X 32-Core Processor as CPU.
Software Dependencies	No	The paper mentions using Proximal Policy Optimization (PPO) as the RL algorithm, but does not provide specific version numbers for PPO or any other key software components, libraries, or programming languages used.
Experiment Setup	Yes	The learning rate for each layout is 2e-3 , 1e-3 , 6e-4 , 8e-4 , and 8e-4. The gamma is 0.99. The lambda is 0.98. The PPO clipping factor is 0.05. The VF coefficient is 0.5. The maximum gradient norm is 0.1. The total training time steps for each PPO update is 48000, divided into 10 mini-batches. The total numbers of generations for each layout are 80, 60, 75, 70, and 70, respectively. For each generation, we update 10 times to approximate the best-preferred strategy. The is 1.