Coach-Player Multi-agent Reinforcement Learning for Dynamic Team Composition

Authors: Bo Liu, Qiang Liu, Peter Stone, Animesh Garg, Yuke Zhu, Anima Anandkumar

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design three benchmark environments with dynamic team compositions for evaluating the zero-shot generalization of our method against baselines. They include a resource collection task, a rescue game, and a set of the customized Star Craft micromanagement tasks. We conduct focused ablation studies examining the design choices of COPA on the resource collection task and the rescue games, and further show that COPA applies for more challenging tasks like Star Craft. Results show comparable or even better performance against methods where players have full observation but no coach.
Researcher Affiliation Collaboration 1Department of Computer Science, The University of Texas at Austin, Austin, USA 2University of Toronto, Toronto, Canada 3Nvidia 4California Institute of Technology, Pasadena, USA.
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not provide any explicit statement or link indicating the availability of the source code for the described methodology.
Open Datasets Yes We design three benchmark environments with dynamic team compositions for evaluating the zero-shot generalization of our method against baselines. They include a resource collection task built on the multi-agent particle environment (Lowe et al., 2017), a multi-agent rescue game, and customized micromanagement tasks in Star Craft. We apply COPA on the more challenging Star Craft multi-agent challenge (SMAC) (Samvelyan et al., 2019).
Dataset Splits No The paper describes training and testing scenarios, including ranges for agent numbers and characteristics for data generation, but does not explicitly specify a distinct validation dataset split with percentages or counts.
Hardware Specification No The paper does not provide specific details about the hardware specifications (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) that are needed to reproduce the experiments.
Experiment Setup Yes We design three benchmark environments with dynamic team compositions for evaluating the zero-shot generalization of our method against baselines. ... To investigate how performance varies with T, we train with different T chosen from [2, 4, 8, 12, 16, 20, 24] in Figure 4(b). ... Lvar(φ, ξ) = λ1Est,za t ,ζa t [log qξ(za t |ζa t , st)] λ2H(za t |st), where λ1 and λ2 are tunable coefficients. ... we propose an intuitive method that decides whether to distribute new strategies based on the ℓ2 distance of the old strategy to the new one. ... Here β is a manually specified threshold.