A Hierarchical Approach to Population Training for Human-AI Collaboration
Authors: Yi Loo, Chen Gong, Malika Meghjani
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method is able to dynamically adapt to novel partners of different play styles and skill levels in the 2-player collaborative Overcooked game environment. We also conducted a human study in the same environment to test the effectiveness of our method when partnering with real human subjects. |
| Researcher Affiliation | Academia | Yi Loo , Chen Gong and Malika Meghjani Singapore University of Technology and Design (SUTD) {yi loo, chen gong}@mymail.sutd.edu.sg, malika meghjani@sutd.edu.sg |
| Pseudocode | Yes | Algorithm 1 Hi PT Rollout |
| Open Source Code | Yes | Code is available at https://gitlab.com/marvl-hipt/hipt. |
| Open Datasets | Yes | We focus the evaluation of our method on the open-sourced1 two-player Overcooked environment by Carroll et. al. [2019] based on the collaborative game Overcooked [Ghost Town Games, 2016]. |
| Dataset Splits | No | The paper describes training agents and evaluating them on novel partners and human subjects within the Overcooked environment. However, it does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the Overcooked environment itself, which is a game simulation rather than a traditional static dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper describes the algorithms used (e.g., PPO) and training parameters, but does not provide specific software dependencies (e.g., library names with version numbers like PyTorch 1.9 or TensorFlow 2.x) needed to replicate the experiment. |
| Experiment Setup | Yes | Training and Implementation Details. We train Hi PT on all five layouts for 108 environment steps. We set the upper and lower bound of the execution length of the low-level policy, (pmin, pmax) to [20, 40] steps. For the high-level policy reward, we set the influence reward coefficient,γ from (3) to 1000 and linearly anneal it to 1 over the entire training process while the environment reward coefficient α is set to 1. |