Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

Authors: Ting Zhu, Yue Jin, Jeremie Houssineau, Giovanni Montana

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical analysis supporting MMQ s potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.
Researcher Affiliation	Academia	1Department of Statistics, University of Warwick, Coventry, UK 2Warwick Manufacturing Group, University of Warwick, Coventry, UK 3School of Physical & Mathematical Sciences, Nanyang Technological University, Singapore 4Alan Turing Institute, London, UK
Pseudocode	Yes	Algorithm 1: MMQ for each agent i
Open Source Code	Yes	The full source code is available at https://github.com/Tingz0/Maxmax_Q_learning.
Open Datasets	Yes	Multi-agent Mu Jo Co Environment We employ the Half-Cheetah 2x3 scenario from the Multi-agent Mu Jo Co framework (de Witt et al., 2020).
Dataset Splits	No	The paper describes custom-designed environments and scenarios, often discussing episode lengths or number of samples for internal algorithmic use (e.g., Monte Carlo optimization), but does not provide specific training/test/validation dataset splits in the conventional sense for a fixed dataset.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment.
Experiment Setup	Yes	Our implementation incorporates two key strategies. First, a delayed update approach for the actor network relative to the critic network, where the critic is updated 10 times more frequently to maintain stability (Fujimoto et al., 2018). Second, negative reward shifting (Sun et al., 2022), which enhances our double-max-style updates (see also Appendix C.1). Our evaluations... show that MMQ outperforms other algorithms with 15 samples drawn from the quantile bounds predicted by two quantile models.