Decentralized Q-learning in Zero-sum Markov Games

Authors: Muhammed Sayin, Kaiqing Zhang, David Leslie, Tamer Basar, Asuman Ozdaglar

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also verify the convergence of the learning dynamics via numerical examples. All the simulations are executed on a desktop computer equipped with a 3.7 GHz Hexa-Core Intel Core i7-8700K processor with Matlab R2019b. The device also has two 8GB 3000MHz DDR4 memories and a NVIDIA Ge Force GTX 1080 8GB GDDR5X graphic card. For illustration, we consider a zero-sum Markov game with 5 states and 3 actions at each state, i.e., S = {1, 2, , 5} and Ai s = {1, 2, 3}.
Researcher Affiliation Academia Muhammed O. Sayin Bilkent University sayin@ee.bilkent.edu.tr Kaiqing Zhang MIT kaiqing@mit.edu David S. Leslie Lancaster University d.leslie@lancaster.ac.uk Tamer Ba sar UIUC basar1@illinois.edu Asuman Ozdaglar MIT asuman@mit.edu
Pseudocode Yes Table 1: Decentralized Q-learning dynamics in Markov games
Open Source Code Yes 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See the supplementary material.
Open Datasets No For illustration, we consider a zero-sum Markov game with 5 states and 3 actions at each state, i.e., S = {1, 2, , 5} and Ai s = {1, 2, 3}. The discount factor γ = 0.6. The reward functions are chosen randomly in a way that r1 s(a1, a2) rs,a1,a2 exp (s2) for s S, where rs,a1,a2 is uniformly drawn from [ 1, 1].
Dataset Splits No 3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] There is no training for the algorithm.
Hardware Specification Yes All the simulations are executed on a desktop computer equipped with a 3.7 GHz Hexa-Core Intel Core i7-8700K processor with Matlab R2019b. The device also has two 8GB 3000MHz DDR4 memories and a NVIDIA Ge Force GTX 1080 8GB GDDR5X graphic card.
Software Dependencies Yes Matlab R2019b
Experiment Setup Yes The discount factor γ = 0.6. For both cases, we choose αc = 1/c0.9 and βc = 1/c with ρα = 0.9, ρβ = 1, and ρ = 0.7, and set τc in accordance with (11) and (12), respectively. For Case 1, we choose ϵ = 2 10 4 and τ = 4.5 104; for Case 2, we choose τ = 0.07.