Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning
Authors: Till Freihaut, Luca Viano, Volkan Cevher, Matthieu Geist, Giorgia Ramponi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide a numerical evaluation of our proposed algorithms in the Markov Game considered in the lower bound construction (Fig. 1) as this environment allows us to control C(µE, νE) by considering different convex combinations of the two pure Nash equilibria profiles (i.e., the black and the blue path in Figure 1). This environment serves as a proof of concept to demonstrate the practical feasibility of our methods. In particular, we aim to highlight that the performance of BC depends on the concentrability coefficient C(µE, νE) even when it is bounded, and completely fails when C(µE, νE) = . Note that in all considered cases, we have that β = 0 and therefore the BC bound proven in Tang et al. [2024] would always be vacuous, while Theorem 3.1 remains valid. The code used for the experiments is available at https://github.com/tfreihaut/Murmail. We evaluate Multi-Agent BC and MURMAIL (Algorithm 2) in the considered environment and measure the exploitability of the resulting policies with respect to the number of expert queries (for MURMAIL) and dataset size (for BC). The results are presented in Fig. 2. |
| Researcher Affiliation | Collaboration | Till Freihaut University of Zurich EMAIL Luca Viano EPFL EMAIL Volkan Cevher EPFL EMAIL Matthieu Geist Earth Species Project EMAIL Giorgia Ramponi University of Zurich EMAIL |
| Pseudocode | Yes | Algorithm 1: Multi-Agent Imitation Learning with Best Response Oracle (MAIL-BRO) Algorithm 2: Maximum Uncertainty Response Multi-Agent Imitation Learning (MURMAIL) Algorithm 3: UCBVI Algorithm 4: MURMAIL with initial dataset |
| Open Source Code | Yes | The code used for the experiments is available at https://github.com/tfreihaut/Murmail. |
| Open Datasets | No | We consider two different environments for our numerical validation, one that has C(µE, νE) < , and the lower bound construction Fig. 1 with different NE experts to control C(µE, νE). In particular we have multiple with C(µE, νE) < and the same NE expert as in Theorem 3.2 to get C(µE, νE) = . For the first environment, we generate a random Zero-Sum Markov Game with |S| = 10, |A| = |B| = 3 and a reward between 1 and 1. |
| Dataset Splits | No | The paper describes generating Markov Games and collecting 'trajectories' for expert demonstrations, but does not specify any explicit training, validation, or test splits for datasets. The empirical evaluation is on these generated environments. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or cloud resources) used for running the experiments within its main text or appendices. |
| Software Dependencies | No | The paper mentions software components like 'UCBVI algorithm' for the RL inner loop, and discusses 'DQN Mnih et al. [2015] or Soft Actor Critic Haarnoja et al. [2018]' as future directions, but it does not specify version numbers for any libraries, frameworks, or solvers actually used in the experimental setup. |
| Experiment Setup | Yes | We run the experiments for each environments 1000 times over different seeds and average the results. For both environments we compute the optimal learning rate η. For simplicity, we use UCBVI algorithm for a state only reward as the RL inner loop of MURMAIL. Note, that this can be replaced by any other no regret algorithm. ... We set the discount factor to 0.9. |