reproducibilityindex.ai

Softened Approximate Policy Iteration for Markov Games

Authors: Julien Pérolat, Bilal Piot, Matthieu Geist, Bruno Scherrer, Olivier Pietquin

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper reports theoretical and empirical investigations on the use of quasi-Newton methods to minimize the Optimal Bellman Residual (OBR) of zero-sum two-player Markov Games. First, it reveals that state-of-the-art algorithms can be derived by the direct application of Newton s method to different norms of the OBR. More precisely, when applied to the norm of the OBR, Newton s method results in the Bellman Residual Minimization Policy Iteration (BRMPI) and, when applied to the norm of the Projected OBR (POBR), it results into the standard Least Squares Policy Iteration (LSPI) algorithm. Consequently, new algorithms are proposed, making use of quasi-Newton methods to minimize the OBR and the POBR so as to take beneﬁt of enhanced empirical performances at low cost. Indeed, using a quasi-Newton method approach introduces slight modiﬁcations in term of coding of LSPI and BRMPI but improves signiﬁcantly both the stability and the performance of those algorithms. These phenomena are illustrated on an experiment conducted on artiﬁcially constructed games called Garnets.
Researcher Affiliation	Collaboration	(a)Univ. Lille, CNRS, Centrale Lille, Inria UMR 9189 CRISt AL, F-59000 Lille, France (b)Inria, Villers-l es-Nancy, F-54600, France (c)Institut Universitaire de France (IUF), France (d) UMI 2958, Georgia Tech-CNRS, Centrale Sup elec, Universit e Paris-Saclay, Metz, France, 1Now with Google Deep Mind
Pseudocode	Yes	Algorithm 1 LSPI-Matrix update, Algorithm 2 BRMPI-Matrix update, Algorithm 3 quasi-Newton LSPI
Open Source Code	No	The paper does not explicitly state that the source code for their methodology is available or provide a link to it. The link provided in the acknowledgments is for the Grid 5000 testbed, not their code: 'Experiments presented in this paper were carried out using the Grid 5000 testbed, supported by a scientiﬁc interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr)'.
Open Datasets	Yes	To empirically illustrate the application of quasi-Newton methods to both MDPs and MGs, we ran experiments on a class of randomly generated MDPs and MGs namely Garnets (see Archibald et al. (1995) for reference on Garnets on MDPs).
Dataset Splits	No	The paper mentions using 'batch samples' for the experiments and generating data using 'Garnets' but does not specify fixed training, validation, and test splits with percentages or counts for reproducibility. For example, it states: 'The number of batch samples used is 0.5 NA NS.' and 'The number of batch samples used is 1.0 NA NA NS.' These are batch sizes, not dataset splits.
Hardware Specification	No	The paper states: 'Experiments presented in this paper were carried out using the Grid 5000 testbed'. While it names the testbed, it does not provide specific hardware details such as CPU/GPU models, memory, or other specifications.
Software Dependencies	No	The paper does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup	Yes	Here, c1 = 10 4 and c2 = 0.9. We performed the line search as follows: if condition (1) was not checked we decreased the learning rate geometrically αk η αk and, if condition (2) was not checked, we increased it geometrically αk 1 η αk (η = 0.9). This was done until both conditions were checked. The feature space Φ has d = 0.2 NS NA features. All Garnets have a sparsity of 0.5 and γ = 0.99.