Solving Zero-Sum Markov Games with Continuous State via Spectral Dynamic Embedding
Authors: Chenhao Zhou, Zebang Shen, zhang chao, Hanbin Zhao, Hui Qian
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present two experiments to evaluate our methods. The first experiment focuses on a simple zero-sum Markov game featuring a continuous state space and a finite action space, aiming to validate the convergence of SDEPO. The second experiment adapts a multi-agent scenario inspired by the simple push [Lowe et al., 2017], where both the state and action spaces are continuous, to assess the effectiveness of SDEPO-NN. |
| Researcher Affiliation | Academia | Chenhao Zhou1 Zebang Shen2 Chao Zhang1 Hanbin Zhao1 Hui Qian1,3 1College of Computer Science and Technology, Zhejiang University 2Department of Computer Science, ETH Zurich 3State Key Lab of CAD&CG, Zhejiang University {zhouchenhao,zczju,zhaohanbin,qianhui}@zju.edu.cn zebang.shen@inf.ethz.ch |
| Pseudocode | Yes | Algorithm 1 Random Features Generation Algorithm 2 Nyström Features Generation Algorithm 3 Spectral Dynamic Embedding Policy Optimization (SDEPO) |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We release code of our experiment and it can reproduce experimental results. |
| Open Datasets | Yes | Next, we conduct experiments on an adapted version of simple push [Lowe et al., 2017], wherein both the state and action spaces are continuous. |
| Dataset Splits | No | In the first experiment, we designed a simple zero-sum Markov game with a continuous state and finite action space (S = R, |A| = 5). The state space is partitioned into 42 distinct intervals: one interval for ( , 10), 40 intervals evenly spaced by 0.5 units in the range [ 10, 10), and one interval for (10, ). In the i-th interval, the transition dynamics are defined by P(s, a, b) = f(s, a, b) + ϵ, where ϵ N(0, 1), and f(s, a, b) = ϵi,a,b, with ϵi,a,b Unif( 10.5, 10.5). The reward function is r(s, a, b) = ϵ i,a,b, where ϵ i,a,b Unif( 1, 1). The initial state distribution is assumed to be uniform over [ 10.5, 10.5]. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | In the first experiment, we designed a simple zero-sum Markov game with a continuous state and finite action space (S = R, |A| = 5). The state space is partitioned into 42 distinct intervals... The transition dynamics are defined by P(s, a, b) = f(s, a, b) + ϵ, where ϵ N(0, 1), and f(s, a, b) = ϵi,a,b, with ϵi,a,b Unif( 10.5, 10.5). The reward function is r(s, a, b) = ϵ i,a,b, where ϵ i,a,b Unif( 1, 1). The initial state distribution is assumed to be uniform over [ 10.5, 10.5]. We ran SDEPO for 120 iterations, and measured the convergence of π by metrics in Proposition 1. As shown in Figure 1, SDEPO with random features and Nyström features both converge after 60 iterations. We discretized the state space of this environment and compared it with OFTRL [Zhang et al., 2022], a tabular method where the environment is known. We adopted the parameter settings recommended in [Zhang et al., 2022] and adjusted the environment to a 100-horizon setting. |