Solving Zero-Sum Markov Games with Continuous State via Spectral Dynamic Embedding

Authors: Chenhao Zhou, Zebang Shen, zhang chao, Hanbin Zhao, Hui Qian

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present two experiments to evaluate our methods. The first experiment focuses on a simple zero-sum Markov game featuring a continuous state space and a finite action space, aiming to validate the convergence of SDEPO. The second experiment adapts a multi-agent scenario inspired by the simple push [Lowe et al., 2017], where both the state and action spaces are continuous, to assess the effectiveness of SDEPO-NN.
Researcher Affiliation Academia Chenhao Zhou1 Zebang Shen2 Chao Zhang1 Hanbin Zhao1 Hui Qian1,3 1College of Computer Science and Technology, Zhejiang University 2Department of Computer Science, ETH Zurich 3State Key Lab of CAD&CG, Zhejiang University {zhouchenhao,zczju,zhaohanbin,qianhui}@zju.edu.cn zebang.shen@inf.ethz.ch
Pseudocode Yes Algorithm 1 Random Features Generation Algorithm 2 Nyström Features Generation Algorithm 3 Spectral Dynamic Embedding Policy Optimization (SDEPO)
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We release code of our experiment and it can reproduce experimental results.
Open Datasets Yes Next, we conduct experiments on an adapted version of simple push [Lowe et al., 2017], wherein both the state and action spaces are continuous.
Dataset Splits No In the first experiment, we designed a simple zero-sum Markov game with a continuous state and finite action space (S = R, |A| = 5). The state space is partitioned into 42 distinct intervals: one interval for ( , 10), 40 intervals evenly spaced by 0.5 units in the range [ 10, 10), and one interval for (10, ). In the i-th interval, the transition dynamics are defined by P(s, a, b) = f(s, a, b) + ϵ, where ϵ N(0, 1), and f(s, a, b) = ϵi,a,b, with ϵi,a,b Unif( 10.5, 10.5). The reward function is r(s, a, b) = ϵ i,a,b, where ϵ i,a,b Unif( 1, 1). The initial state distribution is assumed to be uniform over [ 10.5, 10.5].
Hardware Specification No The paper does not provide specific details on the hardware used for the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes In the first experiment, we designed a simple zero-sum Markov game with a continuous state and finite action space (S = R, |A| = 5). The state space is partitioned into 42 distinct intervals... The transition dynamics are defined by P(s, a, b) = f(s, a, b) + ϵ, where ϵ N(0, 1), and f(s, a, b) = ϵi,a,b, with ϵi,a,b Unif( 10.5, 10.5). The reward function is r(s, a, b) = ϵ i,a,b, where ϵ i,a,b Unif( 1, 1). The initial state distribution is assumed to be uniform over [ 10.5, 10.5]. We ran SDEPO for 120 iterations, and measured the convergence of π by metrics in Proposition 1. As shown in Figure 1, SDEPO with random features and Nyström features both converge after 60 iterations. We discretized the state space of this environment and compared it with OFTRL [Zhang et al., 2022], a tabular method where the environment is known. We adopted the parameter settings recommended in [Zhang et al., 2022] and adjusted the environment to a 100-horizon setting.