reproducibilityindex.ai

Solving Zero-Sum Markov Games with Continuous State via Spectral Dynamic Embedding

Authors: Chenhao Zhou, Zebang Shen, zhang chao, Hanbin Zhao, Hui Qian

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present two experiments to evaluate our methods. The ﬁrst experiment focuses on a simple zero-sum Markov game featuring a continuous state space and a ﬁnite action space, aiming to validate the convergence of SDEPO. The second experiment adapts a multi-agent scenario inspired by the simple push [Lowe et al., 2017], where both the state and action spaces are continuous, to assess the effectiveness of SDEPO-NN.
Researcher Affiliation	Academia	Chenhao Zhou1 Zebang Shen2 Chao Zhang1 Hanbin Zhao1 Hui Qian1,3 1College of Computer Science and Technology, Zhejiang University 2Department of Computer Science, ETH Zurich 3State Key Lab of CAD&CG, Zhejiang University {zhouchenhao,zczju,zhaohanbin,qianhui}@zju.edu.cn zebang.shen@inf.ethz.ch
Pseudocode	Yes	Algorithm 1 Random Features Generation Algorithm 2 Nyström Features Generation Algorithm 3 Spectral Dynamic Embedding Policy Optimization (SDEPO)
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justiﬁcation: We release code of our experiment and it can reproduce experimental results.
Open Datasets	Yes	Next, we conduct experiments on an adapted version of simple push [Lowe et al., 2017], wherein both the state and action spaces are continuous.
Dataset Splits	No	In the ﬁrst experiment, we designed a simple zero-sum Markov game with a continuous state and ﬁnite action space (S = R, \|A\| = 5). The state space is partitioned into 42 distinct intervals: one interval for ( , 10), 40 intervals evenly spaced by 0.5 units in the range [ 10, 10), and one interval for (10, ). In the i-th interval, the transition dynamics are deﬁned by P(s, a, b) = f(s, a, b) + ϵ, where ϵ N(0, 1), and f(s, a, b) = ϵi,a,b, with ϵi,a,b Unif( 10.5, 10.5). The reward function is r(s, a, b) = ϵ i,a,b, where ϵ i,a,b Unif( 1, 1). The initial state distribution is assumed to be uniform over [ 10.5, 10.5].
Hardware Specification	No	The paper does not provide specific details on the hardware used for the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	In the ﬁrst experiment, we designed a simple zero-sum Markov game with a continuous state and ﬁnite action space (S = R, \|A\| = 5). The state space is partitioned into 42 distinct intervals... The transition dynamics are deﬁned by P(s, a, b) = f(s, a, b) + ϵ, where ϵ N(0, 1), and f(s, a, b) = ϵi,a,b, with ϵi,a,b Unif( 10.5, 10.5). The reward function is r(s, a, b) = ϵ i,a,b, where ϵ i,a,b Unif( 1, 1). The initial state distribution is assumed to be uniform over [ 10.5, 10.5]. We ran SDEPO for 120 iterations, and measured the convergence of π by metrics in Proposition 1. As shown in Figure 1, SDEPO with random features and Nyström features both converge after 60 iterations. We discretized the state space of this environment and compared it with OFTRL [Zhang et al., 2022], a tabular method where the environment is known. We adopted the parameter settings recommended in [Zhang et al., 2022] and adjusted the environment to a 100-horizon setting.