A Hierarchical Approach to Population Training for Human-AI Collaboration

Authors: Yi Loo, Chen Gong, Malika Meghjani

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our method is able to dynamically adapt to novel partners of different play styles and skill levels in the 2-player collaborative Overcooked game environment. We also conducted a human study in the same environment to test the effectiveness of our method when partnering with real human subjects.
Researcher Affiliation Academia Yi Loo , Chen Gong and Malika Meghjani Singapore University of Technology and Design (SUTD) {yi loo, chen gong}@mymail.sutd.edu.sg, malika meghjani@sutd.edu.sg
Pseudocode Yes Algorithm 1 Hi PT Rollout
Open Source Code Yes Code is available at https://gitlab.com/marvl-hipt/hipt.
Open Datasets Yes We focus the evaluation of our method on the open-sourced1 two-player Overcooked environment by Carroll et. al. [2019] based on the collaborative game Overcooked [Ghost Town Games, 2016].
Dataset Splits No The paper describes training agents and evaluating them on novel partners and human subjects within the Overcooked environment. However, it does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the Overcooked environment itself, which is a game simulation rather than a traditional static dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper describes the algorithms used (e.g., PPO) and training parameters, but does not provide specific software dependencies (e.g., library names with version numbers like PyTorch 1.9 or TensorFlow 2.x) needed to replicate the experiment.
Experiment Setup Yes Training and Implementation Details. We train Hi PT on all five layouts for 108 environment steps. We set the upper and lower bound of the execution length of the low-level policy, (pmin, pmax) to [20, 40] steps. For the high-level policy reward, we set the influence reward coefficient,γ from (3) to 1000 and linearly anneal it to 1 over the entire training process while the environment reward coefficient α is set to 1.