Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

Authors: Ting Zhu, Yue Jin, Jeremie Houssineau, Giovanni Montana

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical analysis supporting MMQ s potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.
Researcher Affiliation Academia 1Department of Statistics, University of Warwick, Coventry, UK 2Warwick Manufacturing Group, University of Warwick, Coventry, UK 3School of Physical & Mathematical Sciences, Nanyang Technological University, Singapore 4Alan Turing Institute, London, UK
Pseudocode Yes Algorithm 1: MMQ for each agent i
Open Source Code Yes The full source code is available at https://github.com/Tingz0/Maxmax_Q_learning.
Open Datasets Yes Multi-agent Mu Jo Co Environment We employ the Half-Cheetah 2x3 scenario from the Multi-agent Mu Jo Co framework (de Witt et al., 2020).
Dataset Splits No The paper describes custom-designed environments and scenarios, often discussing episode lengths or number of samples for internal algorithmic use (e.g., Monte Carlo optimization), but does not provide specific training/test/validation dataset splits in the conventional sense for a fixed dataset.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment.
Experiment Setup Yes Our implementation incorporates two key strategies. First, a delayed update approach for the actor network relative to the critic network, where the critic is updated 10 times more frequently to maintain stability (Fujimoto et al., 2018). Second, negative reward shifting (Sun et al., 2022), which enhances our double-max-style updates (see also Appendix C.1). Our evaluations... show that MMQ outperforms other algorithms with 15 samples drawn from the quantile bounds predicted by two quantile models.