reproducibilityindex.ai

A Hierarchical Approach to Population Training for Human-AI Collaboration

Authors: Yi Loo, Chen Gong, Malika Meghjani

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our method is able to dynamically adapt to novel partners of different play styles and skill levels in the 2-player collaborative Overcooked game environment. We also conducted a human study in the same environment to test the effectiveness of our method when partnering with real human subjects.
Researcher Affiliation	Academia	Yi Loo , Chen Gong and Malika Meghjani Singapore University of Technology and Design (SUTD) {yi loo, chen gong}@mymail.sutd.edu.sg, malika meghjani@sutd.edu.sg
Pseudocode	Yes	Algorithm 1 Hi PT Rollout
Open Source Code	Yes	Code is available at https://gitlab.com/marvl-hipt/hipt.
Open Datasets	Yes	We focus the evaluation of our method on the open-sourced1 two-player Overcooked environment by Carroll et. al. [2019] based on the collaborative game Overcooked [Ghost Town Games, 2016].
Dataset Splits	No	The paper describes training agents and evaluating them on novel partners and human subjects within the Overcooked environment. However, it does not provide specific train/validation/test dataset splits (e.g., percentages or counts) for the Overcooked environment itself, which is a game simulation rather than a traditional static dataset.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper describes the algorithms used (e.g., PPO) and training parameters, but does not provide specific software dependencies (e.g., library names with version numbers like PyTorch 1.9 or TensorFlow 2.x) needed to replicate the experiment.
Experiment Setup	Yes	Training and Implementation Details. We train Hi PT on all five layouts for 108 environment steps. We set the upper and lower bound of the execution length of the low-level policy, (pmin, pmax) to [20, 40] steps. For the high-level policy reward, we set the influence reward coefficient,γ from (3) to 1000 and linearly anneal it to 1 over the entire training process while the environment reward coefficient α is set to 1.