Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tackling Cooperative Incompatibility for Zero-Shot Human-AI Coordination

Authors: Yang Li, Shao Zhang, Jichen Sun, Wenhao Zhang, Yali Du, Ying Wen, Xinbing Wang, Wei Pan

JAIR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Utilizing the COLE platform, we enlist 130 participants for human experiments. Our findings reveal a preference for our approach over state-of-the-art methods using a variety of subjective metrics. Moreover, objective experimental outcomes in the Overcooked game environment indicate that our method surpasses existing ones when coordinating with previously unencountered AI agents and the human proxy model.
Researcher Affiliation Academia Yang Li EMAIL The University of Manchester Shao Zhang EMAIL Jichen Sun EMAIL Wenhao Zhang EMAIL Shanghai Jiao Tong University Yali Du EMAIL King s College London Ying Wen EMAIL (Corresponding author) Xinbing Wang EMAIL Shanghai Jiao Tong University Wei Pan EMAIL (Corresponding author) The University of Manchester
Pseudocode Yes Algorithm 1 Practical Algorithms 1: Input:population N0, the sample times a, b of Ji, Jc, hyperparameters α, k, solver flag FLAG 2: for t = 1, 2, , do 3: /* Step 1: Completing the payoff matrix */ 4: Mn Simulator(Nt) 5: /* Step 2: Solving the cooperative incompatibility distribution */ 6: if FLAG is SV then 7: /* Selecting Graphic Shapley Value Solver */ 8: ϕ = Graphic Shapley Value(Nt) by Algorithm 2 Algorithm 2 Graphic Shapley Value Solver Algorithm 1: Input:: population N, the number of Monte Carlo permutation sampling k, the size of negative population 2: Initialize ϕ = 0|N| 3: for (1, 2, , k) do 4: π Uniformly sample from ΠC, where ΠC is permutation set 5: for i N do 6: /* Obtain predecessors of player i in sampled permutation π */ 7: Sπ(i) {j N|π(j) < π(i)} 8: /* Update incompatibility weights */ 9: ϕi ϕi + 1 k(v(Sπ(i) {i}) v(Sπ(i)))
Open Source Code Yes Our code and demo are publicly available at https://sites.google.com/view/cole-2023. The code of the platform can be find at https://github.com/liyang619/COLE-Platform.
Open Datasets Yes Our paper implements the platform in the Overcooked environment (Carroll et al., 2020; Charakorn et al., 2020; Knott et al., 2021), a simulation environment for reinforcement learning derived from the Overcooked!2 video game (Carroll et al., 2020). We use the human proxy model Hproxy proposed in (Carroll et al., 2020) as human proxy partners and the models trained with baselines and COLESV as expert partners.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It describes an experimental setup in a simulated environment (Overcooked) and human experiments, but not predefined data splits for a static dataset. For instance, it mentions: "Each participant played a sequence of 5 pairs of games, totaling 10 rounds, with 2 agents..." which describes the experimental design rather than a dataset split for model training/evaluation.
Hardware Specification Yes We run and evaluate all our experiments on Linux servers, which include two types of nodes: 1) 1-GPU node with NVIDIA Ge Force 3090Ti 24G as GPU and AMD EPYC 7H12 64-Core Processor as CPU, 2) 2-GPUs node with Ge Force RTX 3090 24G as GPU and AMD Ryzen Threadripper 3970X 32-Core Processor as CPU.
Software Dependencies No This paper utilizes Proximal Policy Optimization (PPO) (Schulman et al., 2017) as the oracle algorithm for our set of strategies N, which consists of convolutional neural network parameterized strategies. Each network is composed of 3 convolution layers with 25 filters and 3 fully-connected layers with 64 hidden neurons. ... We train and evaluate self-play and PBT based on the Human-Aware Reinforcement Learning repository (Carroll et al., 2020) and used Proximal Policy Optimization (PPO) (Schulman et al., 2017) as the RL algorithm.
Experiment Setup Yes Each network is composed of 3 convolution layers with 25 filters and 3 fully-connected layers with 64 hidden neurons. To manage computational resources, we maintain a population size of 50 strategies. ... The learning rate for each layout is 2e-3 , 1e-3 , 6e-4 , 8e-4 , and 8e-4. The gamma γ is 0.99. The lambda λ is 0.98. The PPO clipping factor is 0.05. The VF coefficient is 0.5. The maximum gradient norm is 0.1. The total training time steps for each PPO update is 48000, divided into 10 mini-batches. The total numbers of generations for each layout are 80, 60, 75, 70, and 70, respectively. For each generation, we update 10 times to approximate the best-preferred strategy. The α is 1.