reproducibilityindex.ai

On the Hardness of Constrained Cooperative Multi-Agent Reinforcement Learning

Authors: Ziyi Chen, Yi Zhou, Heng Huang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A NUMERIC EXAMPLES AND EXPERIMENTS In this section, we implement the primal-dual algorithm (Algorithm 1) and the primal algorithm (Algorithm 2) to the following two numeric examples to verify Theorems 4 and 5. ... K EXPERIMENT ON CONSTRAINED GRID-WORLD We slightly adapt the constrained grid-world task (Diddigi et al., 2019) where two agents explore the 4 × 4 grid-world in Figure 2. ... We implement these algorithms for 100 iterations. ... We plot the learning curves of V0(πt) and V1(πt) in Figure 3. ... Figure 1: Results of the primal algorithm on Examples 1 (the first 5 figures) and 2 (the last figure). Figure 3: Results on the constrained grid task with constraint V1(πt) <= 1. Figure 4: Results on the constrained grid task with constraint V1(πt) <= 1.1.
Researcher Affiliation	Academia	Ziyi Chen, Heng Huang Department of Computer Science University of Maryland College Park, MD 20742, USA {zc286,heng}@umd.edu Yi Zhou Department of Electrical and Computer Engineering University of Utah Salt Lake City, UT 84112, USA yi.zhou@utah.edu
Pseudocode	Yes	Algorithm 1 Primal-Dual Algorithm ... Algorithm 2 Decentralized Primal Algorithm
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or provide a link to a code repository.
Open Datasets	No	The paper uses 'numeric examples' (Example 1, Example 2) and a 'constrained grid-world task' which are described within the paper itself. It does not provide concrete access information (link, DOI, repository, or formal citation) for a publicly available or open dataset.
Dataset Splits	No	The paper describes numerical examples and a grid-world setup, but it does not specify any train/validation/test dataset splits (percentages or counts) or reference predefined splits with citations for reproducibility.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, cloud resources) used to run the experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, or specific solvers with versions) used for the experiments.
Experiment Setup	Yes	Specifically, in Algorithm 1, we use 50 value iterations to obtain the greedy policy πt, exactly evaluate b Vk(πt) = Vk(πt) = 1 1 γ P s,a r0(s, a)νπt(s, a) where the occupation measure νπt(s, a) is known to be the stationary distribution of the mixed transition kernel Pρ( \|s, a) := γP( \|s, a)+(1 γ)ρ( ), and update the multipliers with stepsize β = 1 and threshold λ1,max = 10. In Algorithm 2, we also exactly evaluate b Vk(πt) = Vk(πt) and b Q(m) kt (πt; s, a(m)) = Q(m) kt (πt; s, a(m)), and select stepsize α = 1 and tolerance η = 10−3. The CNAC algorithm essentially follows the primal-dual framework (Algorithm 1) except that the policy πt is updated with one projected stochastic policy gradient ascent step as follows. ... Here, Vp is the product policy space, and we use the exact policy gradient b πL(π, λt) = πL(π, λt) and select stepsize α = 0.2. The update rule of the multipliers for the CNAC algorithm is the same as that for our primal-dual algorithm.