reproducibilityindex.ai

LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning

Authors: Yali Du, Lei Han, Meng Fang, Ji Liu, Tianhong Dai, Dacheng Tao

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare LIIR with a number of state-of-the-art MARL methods on battle games in Star Craft II. The results demonstrate the effectiveness of LIIR, and we show LIIR can assign each individual agent an insightful intrinsic reward per time step. In this section, we ﬁrst evaluate LIIR on a simple 1D pursuit game speciﬁcally designed for the considered settings to see whether LIIR can learn reasonable distinct intrinsic rewards. Then, we comprehensively study LIIR in several challenging micromanagement games in the game of Star Craft II, and compare LIIR with a number of state-of-the-art MARL methods.
Researcher Affiliation	Collaboration	Yali Du University College London London, UK yali.du@ucl.ac.uk Lei Han Tencent AI Lab Shenzhen, Guangdong, China leihan.cs@gmail.com Meng Fang Tencent Robotics X Shenzhen, Guangdong, China mfang@tencent.com Tianhong Dai Imperial College London London, UK tianhong.dai15@imperial.ac.uk Ji Liu Kwai Inc. Seattle, U.S.A. ji.liu.uwisc@gmail.com Dacheng Tao UBTECH Sydney AI Centre, The University of Sydney NSW, Australia dacheng.tao@sydney.edu.au
Pseudocode	Yes	Algorithm 1 The optimization algorithm for LIIR. Input: policy learning rate α and intrinsic reward learning rate β.
Open Source Code	Yes	The source codes of LIIR are available through https://github.com/yalidu/liir.
Open Datasets	Yes	In this subsection, we comprehensively evaluate the proposed LIIR method in the game of Star Craft II based on the learning environment SC2LE [34] and mini-game settings in SMAC [35].
Dataset Splits	No	The paper describes training and testing procedures but does not explicitly detail training, validation, or test dataset splits in terms of percentages, counts, or specific pre-defined partitions for data used within the environments.
Hardware Specification	Yes	We use 32 actors to generate the trajectories in parallel, and use one NVIDIA Tesla M40 GPU for training.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup	Yes	All the methods are trained with 3 millions of steps in 3M and 8M, and with 10 millions of steps for 2S3Z and 3S5Z. The hyper-parameter λ in (2) is set to 0.01 throughout the experiments (we tried different choices of λ while we found that the results did not differ much). We use a ﬁxed learning rate of 5e-4 and use batches of 32 episodes for all the methods.