Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination

Authors: Rui Zhao, Jinming Song, Yufeng Yuan, Haifeng Hu, Yang Gao, Yi Wu, Zhongqian Sun, Wei Yang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our method MEP, with comparison to Self-Play PPO (SP), Population-Based Training (PBT), Trajectory Diversity (Traje Di), and Fictitious Co Play (FCP) in both matrix game and Overcooked game environments, with partners being human proxy models and real humans.
Researcher Affiliation Collaboration Rui Zhao1 , Jinming Song1, Yufeng Yuan1, Haifeng Hu1, Yang Gao2, Yi Wu2, Zhongqian Sun1, Wei Yang1 1 Tencent AI Lab 2 Tsinghua University
Pseudocode Yes Algorithm 1: Maximum Entropy Population
Open Source Code Yes Our code is available at https://github.com/ruizhaogit/ maximum entropy population based training.
Open Datasets Yes Environment: We use the Overcooked environment (Carroll et al. 2019) as the human-AI coordination testbed, see Figure 2.
Dataset Splits No The paper describes training and evaluation on datasets, but does not explicitly specify details for a separate validation dataset split.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9), needed to replicate the experiment.
Experiment Setup Yes First, we train the population using the PE bonus and investigate the effect of the entropy weight α. Secondly, we use the learned maximum entropy population to train the AI agent with the learning progress-based prioritized sampling and report the performance. ... We use the cumulative rewards over a horizon of 400 timesteps as the proxy for coordination ability... For all the results, we report the average reward per episode and the standard error across five different random seeds. ... By default, for Traje Di, MPD, and MEP, we use a population size of 5. However, we use a population size of 10 for FCP in Figure 4a. ... Table 1: Population entropy with different α: In this table, α denotes the weight of the PE reward in Equation (9).