LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning

Authors: Yali Du, Lei Han, Meng Fang, Ji Liu, Tianhong Dai, Dacheng Tao

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare LIIR with a number of state-of-the-art MARL methods on battle games in Star Craft II. The results demonstrate the effectiveness of LIIR, and we show LIIR can assign each individual agent an insightful intrinsic reward per time step. In this section, we first evaluate LIIR on a simple 1D pursuit game specifically designed for the considered settings to see whether LIIR can learn reasonable distinct intrinsic rewards. Then, we comprehensively study LIIR in several challenging micromanagement games in the game of Star Craft II, and compare LIIR with a number of state-of-the-art MARL methods.
Researcher Affiliation Collaboration Yali Du University College London London, UK yali.du@ucl.ac.uk Lei Han Tencent AI Lab Shenzhen, Guangdong, China leihan.cs@gmail.com Meng Fang Tencent Robotics X Shenzhen, Guangdong, China mfang@tencent.com Tianhong Dai Imperial College London London, UK tianhong.dai15@imperial.ac.uk Ji Liu Kwai Inc. Seattle, U.S.A. ji.liu.uwisc@gmail.com Dacheng Tao UBTECH Sydney AI Centre, The University of Sydney NSW, Australia dacheng.tao@sydney.edu.au
Pseudocode Yes Algorithm 1 The optimization algorithm for LIIR. Input: policy learning rate α and intrinsic reward learning rate β.
Open Source Code Yes The source codes of LIIR are available through https://github.com/yalidu/liir.
Open Datasets Yes In this subsection, we comprehensively evaluate the proposed LIIR method in the game of Star Craft II based on the learning environment SC2LE [34] and mini-game settings in SMAC [35].
Dataset Splits No The paper describes training and testing procedures but does not explicitly detail training, validation, or test dataset splits in terms of percentages, counts, or specific pre-defined partitions for data used within the environments.
Hardware Specification Yes We use 32 actors to generate the trajectories in parallel, and use one NVIDIA Tesla M40 GPU for training.
Software Dependencies No The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup Yes All the methods are trained with 3 millions of steps in 3M and 8M, and with 10 millions of steps for 2S3Z and 3S5Z. The hyper-parameter λ in (2) is set to 0.01 throughout the experiments (we tried different choices of λ while we found that the results did not differ much). We use a fixed learning rate of 5e-4 and use batches of 32 episodes for all the methods.