Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning

Authors: Barna Pásztor, Andreas Krause, Ilija Bogunovic

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC, obtained via a novel mean-field type analysis. To learn the system s dynamics, M3 UCRL can be instantiated with various statistical models, e.g., neural networks or Gaussian Processes. Moreover, we provide a practical parametrization of the core optimization problem that facilitates gradient-based optimization techniques when combined with differentiable dynamics approximation methods such as neural networks. ... Our results show that M3 UCRL is capable of finding close-to-optimal policies within a few episodes, in contrast to model-free algorithms that require at least six orders of magnitude more samples.
Researcher Affiliation	Academia	Barna Pásztor EMAIL ETH Zürich Ilija Bogunovic EMAIL University College London Andreas Krause EMAIL ETH Zürich
Pseudocode	Yes	Algorithm 1 Model-based RL for Mean-field Control Input: Calibrated dynamical model, reward function r(st,h, at,h, µt,h), horizon H, initial state s1,0 µ0 for t = 1, 2, . . . do Use M3 UCRL to select policy profile πt = (πt,0, . . . , πt,h 1) by using the current dynamics model and reward function, i.e., solve Eq. (5) for h = 0, . . . , H 1 do at,h = πt,h(st,h, µt,h), st,h+1 = f(st,h, at,h, µt,h) + ωt,h µt,h+1 = Φ(µt,h, πt,h, f) end for Update agent s statistical model with new observations {(st,h, at,h, µt,h), st,h+1}H 1 h=0 Reset the system to µt+1,0 µ0 and st+1,0 µt+1,0 end for
Open Source Code	No	Animations of episodes with the BPTT policy (known_dynamics_animation.mp4) and the M3 UCRL policy (m3_ucrl_animation_episode_t.mp4) are included in the supplementary material. Episode 1 to 10 illustrate the quick learning during the exploration phase while episode 26 is the best-performing one also used for Fig. 2b. The paper mentions animations are included in the supplementary material, but does not explicitly state that the source code for the methodology is provided or give a link to a code repository.
Open Datasets	No	In this section, we demonstrate the performance of the M3 UCRL algorithm on the exploration via entropy maximization problem introduced by Geist et al. (2021) and used as a benchmark problem in a recent survey on Mean-Field Games (Laurière et al., 2022). ... We formulate the problem in the continuous state and action spaces as follows. The state-space of the model is described by a 2-dimensional space in [0, 11]2 which is split into 4 equal-sized rooms separated by unit-sized walls with one corridor connecting neighboring rooms (See Fig. 1a). ... No specific, external publicly available dataset with concrete access information (link, DOI, repository) is mentioned for the experiments. The problem setups are described internally.
Dataset Splits	No	M3 UCRL collects data about the unknown dynamics online (i.e., by proposing and executing a policy in the true system) and estimates the possible dynamics the system might follow. ... The paper describes environments and an online learning process where data is collected dynamically, rather than using pre-split datasets for training, validation, and testing.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	We parameterize our policy with a Neural Network and use a Deep Ensemble model (Lakshminarayanan et al., 2017) for estimating the system dynamics. In particular, we use an ensemble of 10 feed-forward Neural Networks (NNs) with one hidden layer of size 32 and leaky Re Lu action functions. ... The paper mentions using Neural Networks and Deep Ensemble models but does not specify software versions for frameworks like PyTorch, TensorFlow, or other libraries.
Experiment Setup	Yes	Each episode t consists of 21 steps starting from h = 0 and in each time-step h the representative agent chooses its actions from the action space A = [0, 1]2. The dynamics of the system in Eq. (1) are of the following form f(st,h, at,h, µt,h) = st,h + at,h, (7) and the additive noise is Gaussian with zero mean, variance σ2 and independent dimensions, i.e., ωt,h N(0, σ2I2) for all t and h where I2 is the 2 2 unit matrix. ... We use an ensemble of 10 feed-forward Neural Networks (NNs) with one hidden layer of size 32 and leaky Re Lu action functions. We use two output layers joined to the hidden middle layer. The first one uses linear activation and returns the mean of the function while the second returns the estimated variance using softplus activation. We follow the optimisation procedure described in (Lakshminarayanan et al., 2017) that minimizes the negative log-likelihood for each NN under the assumption of heteroscedastic Gaussian noise. We included the adversarial training procedure as well for robustness and smoothing.