Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination
Authors: Rui Zhao, Jinming Song, Yufeng Yuan, Haifeng Hu, Yang Gao, Yi Wu, Zhongqian Sun, Wei Yang
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our method MEP, with comparison to Self-Play PPO (SP), Population-Based Training (PBT), Trajectory Diversity (Traje Di), and Fictitious Co Play (FCP) in both matrix game and Overcooked game environments, with partners being human proxy models and real humans. |
| Researcher Affiliation | Collaboration | Rui Zhao1 , Jinming Song1, Yufeng Yuan1, Haifeng Hu1, Yang Gao2, Yi Wu2, Zhongqian Sun1, Wei Yang1 1 Tencent AI Lab 2 Tsinghua University |
| Pseudocode | Yes | Algorithm 1: Maximum Entropy Population |
| Open Source Code | Yes | Our code is available at https://github.com/ruizhaogit/ maximum entropy population based training. |
| Open Datasets | Yes | Environment: We use the Overcooked environment (Carroll et al. 2019) as the human-AI coordination testbed, see Figure 2. |
| Dataset Splits | No | The paper describes training and evaluation on datasets, but does not explicitly specify details for a separate validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9), needed to replicate the experiment. |
| Experiment Setup | Yes | First, we train the population using the PE bonus and investigate the effect of the entropy weight α. Secondly, we use the learned maximum entropy population to train the AI agent with the learning progress-based prioritized sampling and report the performance. ... We use the cumulative rewards over a horizon of 400 timesteps as the proxy for coordination ability... For all the results, we report the average reward per episode and the standard error across five different random seeds. ... By default, for Traje Di, MPD, and MEP, we use a population size of 5. However, we use a population size of 10 for FCP in Figure 4a. ... Table 1: Population entropy with different α: In this table, α denotes the weight of the PE reward in Equation (9). |