Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

Authors: Niklas Lauffer, Ameesh Shah, Micah Carroll, Sanjit A. Seshia, Stuart J Russell, Michael Dennis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate that our approach achieves strong performance in several popular cooperative and general-sum environments.
Researcher Affiliation Collaboration 1UC Berkeley 2Google Deepmind
Pseudocode Yes Algorithm 1 RPG update with lookahead N
Open Source Code Yes Our project page can be found at rational-policy-gradient.github.io1. 1Code can be found at github.com/niklaslauffer/rational-policy-gradient.
Open Datasets Yes We evaluate our approach in four primary environments: matrix games with varying zero-sum, cooperative, and mixed-motive payoffs; several standard Overcooked [Carroll et al., 2019] layouts, a modified version of STORM [Khan et al., 2023] that requires agents to collect either green or red coins, rewarding them when they collect the same color; and a simplified 2-player version of Hanabi [Bard et al., 2020] that contains 3 colors and ranks and a version that contains 4 colors and ranks (instead of the standard 5).
Dataset Splits No The paper describes using environments (matrix games, Overcooked, STORM, Hanabi) for multi-agent reinforcement learning experiments, which involve generating data through interaction rather than using pre-split static datasets. It mentions 'rollouts' and 'training curves' but does not specify any explicit training/test/validation splits for datasets.
Hardware Specification Yes Experiments were all performed on single GPUs (a mix of A4000s and A6000s) on a local SLURM cluster using 32 cpu cores and 50Gb of RAM.
Software Dependencies No While the paper mentions 'actor-critic', 'PPO', 'Loaded Di CE', 'Di CE', 'SGD', 'Adam', and 'Jax MARL' as software components and algorithms used, it does not provide specific version numbers for any of these, which is required for reproducibility.
Experiment Setup Yes Table 3 contains the RPG hyperparemeters used in the experiments for each of our environments. Table 3: RPG hyperparameters for different environments Hyperparameter Matrix STORM Overcooked Hanabi Architecture 64x64 64x64 64x64 512x512 Optimizer SGD Adam Adam Adam Manipulator LR 1 10 2 8 10 3 1 10 3 2.5 10 3 Base LR 1 10 2 1 10 4 2 10 4 5 10 3 Base lookahead LR 1 10 1 4 10 4 2 10 4 5 10 3 Batch size 128 256 512 128 Discount factor (γ) 0.95 0.99 0.99 0.99 GAE parameter (λ) 0.95 0.99 0.99 0.99 Loaded Di CE coef (λ) 0.95 0.99 0.99 0.99 Value function coef. 0.5 0.5 0.5 0.5 Entropy coef. 0.0 0.01 0.01, SP: 0.05 0.01, SP: 0.05 Manipulator max gradient norm 0.5 0.5 0.5 0.5 Partnerplay coefficient 0 0.15 0.1 0.1