RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization

Authors: Siqi Shen, Chennan Ma, Chao Li, Weiquan Liu, Yongquan Fu, Songzhu Mei, Xinwang Liu, Cheng Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Risk Q can obtain promising performance through extensive experiments. The source code of Risk Q is available in https://github.com/xmu-rl-3dv/Risk Q.
Researcher Affiliation Academia a Fujian Key Laboratory of Sensing and Computing for Smart Cities, School of Informatics, Xiamen University (XMU), China b Key Laboratory of Multimedia Trusted Perception and Efficient Computing, XMU, China c School of Computer, National University of Defense Technology, China
Pseudocode Yes Algorithm 1 The Risk Q Algorithm
Open Source Code Yes The source code of Risk Q is available in https://github.com/xmu-rl-3dv/Risk Q.
Open Datasets Yes We study the performance of Risk Q on risk-sensitive games (Multi-agent cliff and Car following games), the Star Craft II MARL tasks [16].
Dataset Splits No The paper mentions comparing performance and using baselines, but does not specify explicit training/validation/test dataset splits with percentages or sample counts.
Hardware Specification Yes Experiments are carried out on a clusters consists of multiple NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions software components like 'Py MARL framework', 'RMSProp optimizer', 'QR-DQN', and 'TD-lambda learning', but does not provide specific version numbers for these software dependencies (e.g., PyMARL version X.Y, TensorFlow version A.B).
Experiment Setup Yes For Risk Q, unless otherwise specified, the following default configuration is adopted: Wang0.75 is used as the risk measurement. QR-DQN is used to model per-agent s stochastic utility, and the quantile number is set to 32. The RMSProp optimizer is employed with a learning rate of 0.001. Batch size and buffer size are set to 32 and 5000, respectively. Risk Q uses TD-lambda learning with λ = 0.6. The ϵ used in ϵ-greedy annealed from 1 to 0.05 within 100K time steps.