On the Hardness of Constrained Cooperative Multi-Agent Reinforcement Learning
Authors: Ziyi Chen, Yi Zhou, Heng Huang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A NUMERIC EXAMPLES AND EXPERIMENTS In this section, we implement the primal-dual algorithm (Algorithm 1) and the primal algorithm (Algorithm 2) to the following two numeric examples to verify Theorems 4 and 5. ... K EXPERIMENT ON CONSTRAINED GRID-WORLD We slightly adapt the constrained grid-world task (Diddigi et al., 2019) where two agents explore the 4 × 4 grid-world in Figure 2. ... We implement these algorithms for 100 iterations. ... We plot the learning curves of V0(πt) and V1(πt) in Figure 3. ... Figure 1: Results of the primal algorithm on Examples 1 (the first 5 figures) and 2 (the last figure). Figure 3: Results on the constrained grid task with constraint V1(πt) <= 1. Figure 4: Results on the constrained grid task with constraint V1(πt) <= 1.1. |
| Researcher Affiliation | Academia | Ziyi Chen, Heng Huang Department of Computer Science University of Maryland College Park, MD 20742, USA {zc286,heng}@umd.edu Yi Zhou Department of Electrical and Computer Engineering University of Utah Salt Lake City, UT 84112, USA yi.zhou@utah.edu |
| Pseudocode | Yes | Algorithm 1 Primal-Dual Algorithm ... Algorithm 2 Decentralized Primal Algorithm |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | No | The paper uses 'numeric examples' (Example 1, Example 2) and a 'constrained grid-world task' which are described within the paper itself. It does not provide concrete access information (link, DOI, repository, or formal citation) for a publicly available or open dataset. |
| Dataset Splits | No | The paper describes numerical examples and a grid-world setup, but it does not specify any train/validation/test dataset splits (percentages or counts) or reference predefined splits with citations for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, cloud resources) used to run the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or specific solvers with versions) used for the experiments. |
| Experiment Setup | Yes | Specifically, in Algorithm 1, we use 50 value iterations to obtain the greedy policy πt, exactly evaluate b Vk(πt) = Vk(πt) = 1 1 γ P s,a r0(s, a)νπt(s, a) where the occupation measure νπt(s, a) is known to be the stationary distribution of the mixed transition kernel Pρ( |s, a) := γP( |s, a)+(1 γ)ρ( ), and update the multipliers with stepsize β = 1 and threshold λ1,max = 10. In Algorithm 2, we also exactly evaluate b Vk(πt) = Vk(πt) and b Q(m) kt (πt; s, a(m)) = Q(m) kt (πt; s, a(m)), and select stepsize α = 1 and tolerance η = 10−3. The CNAC algorithm essentially follows the primal-dual framework (Algorithm 1) except that the policy πt is updated with one projected stochastic policy gradient ascent step as follows. ... Here, Vp is the product policy space, and we use the exact policy gradient b πL(π, λt) = πL(π, λt) and select stepsize α = 0.2. The update rule of the multipliers for the CNAC algorithm is the same as that for our primal-dual algorithm. |