Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning
Authors: Ting Zhu, Yue Jin, Jeremie Houssineau, Giovanni Montana
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical analysis supporting MMQ s potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency. |
| Researcher Affiliation | Academia | 1Department of Statistics, University of Warwick, Coventry, UK 2Warwick Manufacturing Group, University of Warwick, Coventry, UK 3School of Physical & Mathematical Sciences, Nanyang Technological University, Singapore 4Alan Turing Institute, London, UK |
| Pseudocode | Yes | Algorithm 1: MMQ for each agent i |
| Open Source Code | Yes | The full source code is available at https://github.com/Tingz0/Maxmax_Q_learning. |
| Open Datasets | Yes | Multi-agent Mu Jo Co Environment We employ the Half-Cheetah 2x3 scenario from the Multi-agent Mu Jo Co framework (de Witt et al., 2020). |
| Dataset Splits | No | The paper describes custom-designed environments and scenarios, often discussing episode lengths or number of samples for internal algorithmic use (e.g., Monte Carlo optimization), but does not provide specific training/test/validation dataset splits in the conventional sense for a fixed dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9) needed to replicate the experiment. |
| Experiment Setup | Yes | Our implementation incorporates two key strategies. First, a delayed update approach for the actor network relative to the critic network, where the critic is updated 10 times more frequently to maintain stability (Fujimoto et al., 2018). Second, negative reward shifting (Sun et al., 2022), which enhances our double-max-style updates (see also Appendix C.1). Our evaluations... show that MMQ outperforms other algorithms with 15 samples drawn from the quantile bounds predicted by two quantile models. |