Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning
Authors: Hao-Lun Hsu, Weixin Wang, Miroslav Pajic, Pan Xu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (i.e., N-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models. |
| Researcher Affiliation | Academia | Hao-Lun Hsu , Weixin Wang , Miroslav Pajic, Pan Xu Duke University {hao-lun.hsu,weixin.wang,miroslav.pajic,pan.xu}@duke.edu |
| Pseudocode | Yes | A unified algorithm framework is presented in Algorithm 1, where each agent executes Least-Square Value Iteration (LSVI) in parallel and makes decisions based on collective data obtained from communication between each agent and the server. |
| Open Source Code | Yes | The implementation of this work can be found at https://github.com/panxulab/MARL-Coop TS |
| Open Datasets | Yes | We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (i.e., N-chain), a video game, and a real-world problem in energy systems. |
| Dataset Splits | No | The paper does not explicitly specify training, validation, and test splits for the datasets. It describes episodic reinforcement learning settings but not data partitioning for supervised learning. |
| Hardware Specification | Yes | Note that we run all our experiments on Nvidia RTX A5000 with 24GB RAM. |
| Software Dependencies | No | The paper mentions software components like "deep Q-networks (DQNs)", "Adam SGLD", "PyTorch", and "Relu" but does not specify their version numbers. |
| Experiment Setup | Yes | We list the details of all swept hyper-parameters in N-chain for PHE and LMC in Table 2 and Table 3 respectively. Specifically, PHE is trained with reward noise ϵk,l,n h = 10 2 and regularizer noise ξk,n h = 10 3 in (3.5) and LMC is trained with βm,k = 102 and in (3.7) and optimized by Adam SGLD [33] with α1 = 0.9, α2 = 0.999 and bias factor α = 0.1. The final hyper-parameters used in N-chain are presented in Table 4. ... The detailed hyper-parameters for Super Mario Bros task are presented in Table 5. ... The hyper-parameters we used are in Table 6. |