Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Principled Unsupervised Multi-Agent Reinforcement Learning

Authors: Riccardo Zamboni, Mirco Mutti, Marcello Restelli

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we provide numerical validations to both corroborate the theoretical findings and pave the way for unsupervised multi-agent reinforcement learning via state entropy maximization in challenging domains, showing that optimizing for a specific objective, namely mixture entropy, provides an excellent trade-off between tractability and performances.
Researcher Affiliation Academia Riccardo Zamboni Politecnico di Milano EMAIL Mirco Mutti Technion Marcello Restelli Politecnico di Milano
Pseudocode Yes Algorithm: Trust Region Pure Exploration (TRPE) 1: Input: exploration horizon T, trajectories N, trust-region threshold δ, learning rate η 2: Initialize θ pθiqi Pr Ns 3: for epoch = 1, 2, . . . until convergence do 4: Collect N trajectories with πθ pπi θiqi Pr Ns 5: for agent i 1, 2, . . . concurrently do 6: Set datasets Di tpsi n, ai nq, ζn 1 un Pr Ns 7: h 0, θi h θi 8: while DKLpπi θi h}πi θi 0q ď δ do 9: Compute ˆLipθi h{θi 0q via IS as in Eq. (4) 10: θi h 1 θi h η θi h ˆLipθi h{θi 0q 11: h Ð h 1 12: end while 13: θi Ð θi h 14: end for 15: end for 16: Output: joint policy πθ pπi θiqi Pr Ns
Open Source Code Yes The Repository is made available at the following Repository.
Open Datasets Yes The first is a notoriously difficult multi-agent exploration task called secret room [MPE, Liu et al., 2021], referred to as Env. (i). ... The second is a simpler exploration task yet over a high dimensional state-space, namely a 2-agent instantiation of Reacher [Ma Mu Jo Co, Peng et al., 2021], referred to as Env. (ii).
Dataset Splits No The paper describes generating data through interaction with environments (trajectories, episodes) rather than using predefined splits of a static dataset. It mentions 'exploration horizon T', 'N trajectories', 'number of evaluation episodes/trials (K)', but no explicit training/validation/test splits of a dataset are provided.
Hardware Specification Yes All the experiments were performed over an Apple M2 chip (8-core CPU, 8-core GPU, 16-core Neural Engine) with 8 GB unified memory with a maximum time of execution of 24 hours.
Software Dependencies No The paper describes the use of Neural Networks for policy parameterization and Gaussian distributions but does not list specific software libraries, frameworks, or their version numbers (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes Throughout the experiment the number of epochs e were set equal to e 10k, the number of trajectories N 10, the KL threshold δ 6, the maximum number of off-policy iterations set to noff,iter 20, the learning rate was set to η 10 5 and the number of seeds set equal to 4 due to the inherent low stochasticity of the environment. ... In each epoch a dataset of N trajectories is gathered for a given exploration horizon T for each agent, leading to the reported number of samples. Throughout the experiment the number of epochs e were set equal to e 100, the number of trajectories building the batch size N 20, the KL threshold δ 10 4, the maximum number of off-policy iterations set to noff,iter 20, the discount was set to γ 0.99.