Softened Approximate Policy Iteration for Markov Games
Authors: Julien Pérolat, Bilal Piot, Matthieu Geist, Bruno Scherrer, Olivier Pietquin
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper reports theoretical and empirical investigations on the use of quasi-Newton methods to minimize the Optimal Bellman Residual (OBR) of zero-sum two-player Markov Games. First, it reveals that state-of-the-art algorithms can be derived by the direct application of Newton s method to different norms of the OBR. More precisely, when applied to the norm of the OBR, Newton s method results in the Bellman Residual Minimization Policy Iteration (BRMPI) and, when applied to the norm of the Projected OBR (POBR), it results into the standard Least Squares Policy Iteration (LSPI) algorithm. Consequently, new algorithms are proposed, making use of quasi-Newton methods to minimize the OBR and the POBR so as to take benefit of enhanced empirical performances at low cost. Indeed, using a quasi-Newton method approach introduces slight modifications in term of coding of LSPI and BRMPI but improves significantly both the stability and the performance of those algorithms. These phenomena are illustrated on an experiment conducted on artificially constructed games called Garnets. |
| Researcher Affiliation | Collaboration | (a)Univ. Lille, CNRS, Centrale Lille, Inria UMR 9189 CRISt AL, F-59000 Lille, France (b)Inria, Villers-l es-Nancy, F-54600, France (c)Institut Universitaire de France (IUF), France (d) UMI 2958, Georgia Tech-CNRS, Centrale Sup elec, Universit e Paris-Saclay, Metz, France, 1Now with Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 LSPI-Matrix update, Algorithm 2 BRMPI-Matrix update, Algorithm 3 quasi-Newton LSPI |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is available or provide a link to it. The link provided in the acknowledgments is for the Grid 5000 testbed, not their code: 'Experiments presented in this paper were carried out using the Grid 5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr)'. |
| Open Datasets | Yes | To empirically illustrate the application of quasi-Newton methods to both MDPs and MGs, we ran experiments on a class of randomly generated MDPs and MGs namely Garnets (see Archibald et al. (1995) for reference on Garnets on MDPs). |
| Dataset Splits | No | The paper mentions using 'batch samples' for the experiments and generating data using 'Garnets' but does not specify fixed training, validation, and test splits with percentages or counts for reproducibility. For example, it states: 'The number of batch samples used is 0.5 NA NS.' and 'The number of batch samples used is 1.0 NA NA NS.' These are batch sizes, not dataset splits. |
| Hardware Specification | No | The paper states: 'Experiments presented in this paper were carried out using the Grid 5000 testbed'. While it names the testbed, it does not provide specific hardware details such as CPU/GPU models, memory, or other specifications. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software components, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | Here, c1 = 10 4 and c2 = 0.9. We performed the line search as follows: if condition (1) was not checked we decreased the learning rate geometrically αk η αk and, if condition (2) was not checked, we increased it geometrically αk 1 η αk (η = 0.9). This was done until both conditions were checked. The feature space Φ has d = 0.2 NS NA features. All Garnets have a sparsity of 0.5 and γ = 0.99. |