Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HyperMARL: Adaptive Hypernetworks for Multi-Agent RL

Authors: Kale-ab Tessera, Muhammad Arrasy Rahman, Amos J. Storkey, Stefano Albrecht

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive evaluation (Sec. 5) across 22 diverse scenarios (up to 30 agents) shows Hyper MARL achieves competitive returns against six strong baselines, while achieving No PS-level behavioural diversity. We further show this decoupling is empirically linked to reduced policy gradient variance and is critical for specialisation (Sec. 5.2; Sec. 6.1).
Researcher Affiliation Collaboration 1School of Informatics, University of Edinburgh, Edinburgh, UK 2School of Computer Science, University of Texas at Austin, Austin, TX, USA 3Deep Flow, London, UK
Pseudocode Yes We present the pseudocode in Sec. F.1, with additional scaling (F.3) and runtime (F.4) details. In Algorithm 1, we present the pseudocode for Hyper MARL, with Hyper MARL-specific steps highlighted in blue.
Open Source Code Yes The code is publicly available at https://github.com/Kaleab Tessera/Hyper MARL.
Open Datasets Yes We validate Hyper MARL on diverse MARL benchmarks including Dispersion and Navigation (VMAS) [5], Multi-Agent Mu Jo Co (MAMu Jo Co) [37], SMAX [41], and Blind-Particle Spread (BPS) [11] across environments with two to thirty agents that require homogeneous, heterogeneous, or mixed behaviours.
Dataset Splits No For Dispersion (5.2), evaluation is performed every 100k timesteps across 32 episodes. For Navigation (5.2), following the baselines, evaluation is performed every 120k timesteps across 200 episodes. For SMAX (5.3), evaluation is performed every 500k timesteps across 32 episodes. For Ma Mu Jo Co (5.2), following the baselines, evaluation is performed every 25 training episodes over 40 episodes. For Blind-Particle Spread (BPS), we run 5 seeds and train for 20 million timesteps, consistent with baselines.
Hardware Specification Yes The benchmarks were conducted using JAX on a single NVIDIA GPU (T4) with a recurrent (GRUbased) policy architecture. All experiments used fixed network sizes (64-dimensional embeddings and hidden layers) with a batch size of 128 and 64 parallel environments, allowing us to isolate the effects of varying agent count. Each measurement represents the average of 100 forward passes per configuration, with operations repeated across 10 independent trials.
Software Dependencies No The benchmarks were conducted using JAX on a single NVIDIA GPU (T4) with a recurrent (GRUbased) policy architecture.
Experiment Setup Yes Table 9: Hyperparameters, Training and Evaluation for Specialisation and Synchronisation Game. Table 10: IPPO and MAPPO Hyperparameters in Dispersion. Table 13: IPPO Hyperparameters for Navigation. Table 18: Default algorithm and model hyperparameters for the Ant-v2-4x2 environment (from [54]).