reproducibilityindex.ai

Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination

Authors: Rui Zhao, Jinming Song, Yufeng Yuan, Haifeng Hu, Yang Gao, Yi Wu, Zhongqian Sun, Wei Yang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our method MEP, with comparison to Self-Play PPO (SP), Population-Based Training (PBT), Trajectory Diversity (Traje Di), and Fictitious Co Play (FCP) in both matrix game and Overcooked game environments, with partners being human proxy models and real humans.
Researcher Affiliation	Collaboration	Rui Zhao1 , Jinming Song1, Yufeng Yuan1, Haifeng Hu1, Yang Gao2, Yi Wu2, Zhongqian Sun1, Wei Yang1 1 Tencent AI Lab 2 Tsinghua University
Pseudocode	Yes	Algorithm 1: Maximum Entropy Population
Open Source Code	Yes	Our code is available at https://github.com/ruizhaogit/ maximum entropy population based training.
Open Datasets	Yes	Environment: We use the Overcooked environment (Carroll et al. 2019) as the human-AI coordination testbed, see Figure 2.
Dataset Splits	No	The paper describes training and evaluation on datasets, but does not explicitly specify details for a separate validation dataset split.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9), needed to replicate the experiment.
Experiment Setup	Yes	First, we train the population using the PE bonus and investigate the effect of the entropy weight α. Secondly, we use the learned maximum entropy population to train the AI agent with the learning progress-based prioritized sampling and report the performance. ... We use the cumulative rewards over a horizon of 400 timesteps as the proxy for coordination ability... For all the results, we report the average reward per episode and the standard error across ﬁve different random seeds. ... By default, for Traje Di, MPD, and MEP, we use a population size of 5. However, we use a population size of 10 for FCP in Figure 4a. ... Table 1: Population entropy with different α: In this table, α denotes the weight of the PE reward in Equation (9).