Explicitly Coordinated Policy Iteration

Authors: Yujing Hu, Yingfeng Chen, Changjie Fan, Jianye Hao

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in matrix games (from 2-agent 2-action games to 5-agent 20-action games) and stochastic games (from 2-agent games to 5-agent games) show that EXCEL has better performance than the state-of-the-art algorithms (such as faster convergence and better coordination).
Researcher Affiliation Collaboration Yujing Hu1 , Yingfeng Chen1 , Changjie Fan1 and Jianye Hao2 1Fuxi AI Lab in Netease 2Tianjin University {huyujing, chenyingfeng1, fanchangjie}@corp.netease.com jianye.hao@tju.edu.cn
Pseudocode Yes Algorithm 1: Explicitly Coordinated Policy Iteration
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets No The paper describes the 'matrix games' and 'grid world games' used for experiments, including their setup and characteristics (e.g., map sizes, number of agents), but it does not specify an existing, publicly available dataset with concrete access information (link, DOI, formal citation).
Dataset Splits No The paper describes training periods and game plays for evaluation but does not specify explicit training, validation, and test dataset splits with percentages or sample counts for reproducibility.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes All the algorithms except HYQ use a learning rate of 0.2. The positive and negative learning rates of HYQ are 0.05 and 0.02, respectively. The learning rate of frequency of RFMQ is 0.01. The complementary factor of EXCEL increases from 0 to 1 with an increment 0.001. The algorithms EXCEL, HYQ, and RFMQ adopt ϵ-greedy exploration with ϵ decaying from 1.0 to 0.1 exponentially by a factor 0.99977. LMRL2 uses Boltzmann exploration with the temperature of each action decaying from 50 to 0.1 by a factor 0.9. The moderation factors of action selection and lenience for LMRL2 are both 1.0. The network parameters are optimized by Adam with a learning rate of 10 3 and a batch size of 1024. The threshold δmin of the NT method increases from 20 to 0 in 10, 000 episodes. The weight coefficient β of the WNTD method and the sampling ratio η of the NS method both decay from 1.0 to 10 4 in 100, 000 episodes. The coefficient for negative update of HDQN is 0.4. LDQN uses retroactive temperature decay schedule and the corresponding hyperparameters are ρ = 0.01, d = 0.95, µ = 0.9995, v = 1. For exploration, all the algorithms uses ϵ-greedy strategy with ϵ decaying from 1 to 0.2 in 60, 000 episodes.