On the Convergence of Model Free Learning in Mean Field Games
Authors: Romuald Elie, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Olivier Pietquin7143-7150
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate our theoretical results with a numerical experiment in a continuous action-space environment, where the approximate best response of the iterative fictitious play scheme is computed with a deep RL algorithm. Notably, we show for the first time convergence of model free learning algorithms towards non-stationary MFG equilibria, relying only on classical assumptions on the MFG dynamics. The main contribution of this paper is theoretical, as we provide a rigorous study of the error propagation in Approximate FP algorithms for MFGs, using an innovative line of proof in comparison to the standard two time scale approximation convergence results (Leslie and Collins 2006; Borkar 1997). Our numerical results also demonstrate the empirical convergence of the Fictitious RL scheme in a larger setting, even when the MFG is of not first order type. |
| Researcher Affiliation | Collaboration | Romuald Elie,1 Julien P erolat,2 Mathieu Lauri ere,3 Matthieu Geist,4 Olivier Pietquin4 1Universit e Paris-Est, 2Deepmind 3ORFE, Princeton University, 4Google Research, Brain Team |
| Pseudocode | Yes | Algorithm 1: Approximate Fictitious Play for MFG. Algorithm 2: Fictitious Play for continuous state and action Mean Field Games. |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes a numerical illustration using a stylized authoritative MFG model with congestion. It generates its own trajectories and replay buffers, stating: 'We ran 30000 trajectories of DDPG with a trajectory length of 300.' and 'At each iteration of FP, we added Ntrajectories FP = 3000 trajectories of length 1000 to the replay buffer.' It does not use or provide concrete access information for a publicly available dataset. |
| Dataset Splits | No | The paper describes the generation of trajectories for training and estimation but does not specify a train/validation/test split for a dataset, nor does it refer to pre-defined splits for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or cluster specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'Deep Deterministic Policy Gradient (DDPG)' and 'Adam optimizers' but does not provide specific version numbers for these software components or any other libraries, which is required for reproducibility. |
| Experiment Setup | Yes | We ran 30000 trajectories of DDPG with a trajectory length of 300. The noise used for exploration is a centered normal noise with variance 0.02 and we used Adam optimizers with 0.001 starting learning rate and τ = 0.01. At each iteration of FP, we added Ntrajectories FP = 3000 trajectories of length 1000 to the replay buffer. Finally, we estimated the density using 100 classes and doing 30000 steps of Adam (with 0.001 initial learning rate). |