reproducibilityindex.ai

OPtions as REsponses: Grounding behavioural hierarchies in multi-agent reinforcement learning

Authors: Alexander Vezhnevets, Yuhuai Wu, Maria Eckstein, Rémi Leblond, Joel Z Leibo

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that the proposed hierarchical agent is capable of generalisation to unseen opponents, while conventional baselines fail to generalise whatsoever. Experimental results are presented in Section 5.
Researcher Affiliation	Collaboration	1Deep Mind, London, UK 2University of Toronto, Canada 3University of California Berkeley, USA.
Pseudocode	Yes	The pseudo-code is provided in the supplementary material section D.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	No	The paper introduces two novel grid world multi-agent games, Running With Scissors (RWS) and RPS Arena, and describes how agents were trained within these environments, but it does not provide concrete access information (link, DOI, formal citation) for a publicly available or open dataset.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits) for train/validation/test sets. It describes a self-play training setup and evaluation against held-out opponents, but not traditional dataset splits.
Hardware Specification	No	A few hundred CPU actors generate game trajectories, which are then batched and consumed by a GPU learner (one per unique agent). The paper mentions 'CPU actors' and 'GPU learner' but does not provide specific hardware details (e.g., model numbers, processor types).
Software Dependencies	No	The paper mentions using an 'IMPALA (Espeholt et al., 2018) based computational setup' but does not specify software names with version numbers for other ancillary software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The network starts with a 1D convolutional layer with 6 channels, followed by an MLP with (64, 64) neurons, then by an LSTM with 128 hidden neurons. We construct 16 policy heads (one for each η), where each is an MLP with 128 hidden neurons taking LSTM output as an input. ... The learner receives temporally truncated sequences of 100 steps of trajectories in batches of 16. ... The hyper-parameters were tuned on the RWS task in the regime of training and evaluating against competitors; we first tuned the parameters of the baseline, then tuned the extra hyper-parameters of OPRE. The appendix containst all hyper-parameters, their values and precise network parametrisation like layer sizes.