Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

Authors: Hao-Lun Hsu, Weixin Wang, Miroslav Pajic, Pan Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (i.e., N-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models.
Researcher Affiliation Academia Hao-Lun Hsu , Weixin Wang , Miroslav Pajic, Pan Xu Duke University {hao-lun.hsu,weixin.wang,miroslav.pajic,pan.xu}@duke.edu
Pseudocode Yes A unified algorithm framework is presented in Algorithm 1, where each agent executes Least-Square Value Iteration (LSVI) in parallel and makes decisions based on collective data obtained from communication between each agent and the server.
Open Source Code Yes The implementation of this work can be found at https://github.com/panxulab/MARL-Coop TS
Open Datasets Yes We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (i.e., N-chain), a video game, and a real-world problem in energy systems.
Dataset Splits No The paper does not explicitly specify training, validation, and test splits for the datasets. It describes episodic reinforcement learning settings but not data partitioning for supervised learning.
Hardware Specification Yes Note that we run all our experiments on Nvidia RTX A5000 with 24GB RAM.
Software Dependencies No The paper mentions software components like "deep Q-networks (DQNs)", "Adam SGLD", "PyTorch", and "Relu" but does not specify their version numbers.
Experiment Setup Yes We list the details of all swept hyper-parameters in N-chain for PHE and LMC in Table 2 and Table 3 respectively. Specifically, PHE is trained with reward noise ϵk,l,n h = 10 2 and regularizer noise ξk,n h = 10 3 in (3.5) and LMC is trained with βm,k = 102 and in (3.7) and optimized by Adam SGLD [33] with α1 = 0.9, α2 = 0.999 and bias factor α = 0.1. The final hyper-parameters used in N-chain are presented in Table 4. ... The detailed hyper-parameters for Super Mario Bros task are presented in Table 5. ... The hyper-parameters we used are in Table 6.