Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased

Authors: Chao Yu, Jiaxuan Gao, Weilin Liu, Botian Xu, Hao Tang, Jiaqi Yang, Yu Wang, Yi Wu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate HSP on the Overcooked benchmark. Empirical results show that our HSP method produces higher rewards than baselines when cooperating with learned human models, manually scripted policies, and real humans.
Researcher Affiliation Academia 1 Tsinghua University, 2 UC Berkeley, 3 Shanghai Qi Zhi Institute
Pseudocode Yes Algorithm 1: Greedy Policy Selection
Open Source Code No We would suggest visiting https://sites.google.com/view/hsp-iclr for more information. The paper does not explicitly state that source code for the methodology is provided, nor does it link directly to a source code repository.
Open Datasets Yes Overcooked Game: Overcooked (Carroll et al., 2019), which is a fully observable two-player cooperative game.
Dataset Splits No The paper does not specify explicit training/validation/test dataset splits with percentages or counts, as it operates within a reinforcement learning environment rather than a static dataset.
Hardware Specification No The paper mentions inference being performed on 'CPUs' and 'a GPU' but does not specify any particular hardware models, types, or quantities (e.g., NVIDIA A100, Intel Xeon).
Software Dependencies No The paper mentions using 'MAPPO' as the RL algorithm, but does not provide specific version numbers for any software dependencies like programming languages (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch 1.x).
Experiment Setup Yes Common hyperparameters for all methods in 5 layouts are listed in Table 8 and Table 9. Specifically, for MEP, we use the suggested hyperparameters from the original paper (Zhao et al., 2021). Detailed hyperparameters of MEP are shown in Table 10, where population entropy coef. adjusts the importance of the population entropy term. Detailed hyperparameters of Traj Div are shown in Table 11, where traj. gamma is the discounting factor used in local action kernel and diversity coef. adjusts the importance of the diversity term.