Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning

Authors: Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a large set of experiments and illustrate that our proposed HOG methods outperform the existing ones regarding eﬃciency and performance.
Researcher Affiliation	Academia	Ariyan Bighashdel EMAIL Daan de Geus EMAIL Pavol Jancura EMAIL Gijs Dubbelman EMAIL Department of Electrical Engineering Eindhoven University of Technology Eindhoven, 5612 AZ, The Netherlands
Pseudocode	Yes	Algorithm 1: LOLA-OﬀPA2 for a set of n self-interested agents (N). Algorithm 2: LA-OﬀPA2 for a set of n self-interested agents (N). Algorithm 3: HLA-OﬀPA2 for a set of m common-interested agents (M).
Open Source Code	Yes	The source code of our OﬀPA2 framework is available at a Git Hub repository1. 1. https://github.com/tue-mps/OffPA2
Open Datasets	Yes	We evaluate the methods on the non-diﬀerentiable version of the rotational game proposed by Zhang and Lesser (2010), and we refer to it as the Iterated Rotational Game (IRG). Iterated Prisoner s Dilemma (IPD) (Foerster et al., 2018a) is a ﬁve-state, two-agent, two-action game with the reward matrices depicted in Table 4. Inspired by Vinitsky et al. (2019), we propose an Exit-Room game with three levels of complexity (see Figure 5). To demonstrate the coordination capability of HLAOﬀPA2, we propose the Particle-Coordination Game (PCG) in the Particle environment (Lowe et al., 2017). Furthermore, we compare the methods in three games within the multi-agent Mujoco environment (Peng et al., 2021): 1) two-agent Half-Cheetah, 2) two-agent Walker, and 3) two-agent Reacher.
Dataset Splits	Yes	We created separate validation and test sets for each game that included 100 and 300 randomly generated scenarios, respectively.
Hardware Specification	No	This work made use of the Dutch national e-infrastructure with the support of the SURF Cooperative using grant no. EINF-6816, which is ﬁnanced by the Dutch Research Council (NWO).
Software Dependencies	No	In practice, we don t need to rely on Taylor expansion for the update rules in LOLAOﬀPA2 as we can use an automatic diﬀerentiation engine, e.g., Py Torch autograd (Paszke et al., 2019), to directly compute the gradients. The algorithms are trained for 900 (in IRG) and 50 (in IPD) episodes by running Adam optimizer (Kingma and Ba, 2015) with a ﬁxed learning rate of 0.01. In order to make the state-action value functions anyorder diﬀerentiable, we used Si LU nonlinear function (Elfwing et al., 2018) in between the hidden layers. For IRG, we used the Sigmoid function in the policies to output 1-D continues action, and for IPD, we used the Gumble-softmax function (Jang et al., 2017) in the policies to output two discrete actions.
Experiment Setup	Yes	We employed Multi-Layer Perceptron (MLP) networks with two hidden layers of dimension 64 for policies and value functions. For IRG, we used the Sigmoid function in the policies to output 1-D continues action, and for IPD, we used the Gumble-softmax function (Jang et al., 2017) in the policies to output two discrete actions. The algorithms are trained for 900 (in IRG) and 50 (in IPD) episodes by running Adam optimizer (Kingma and Ba, 2015) with a ﬁxed learning rate of 0.01. The (projected) prediction lengths in OﬀPA2 and Di CE frameworks are tuned and set to 0.8 and 0.3, respectively. Both policy and value networks consist of two parts: encoder and decoder. The encoders are CNN networks with three convolutional layers (12 90 90 32 21 21 64 9 9 64 7 7 ) and two fully connected layers (3136 512 128 ), with Si LU nonlinear functions (Elfwing et al., 2018) in between. The decoders are MLP networks with two hidden layers of dimension 64 for policies and value functions. The algorithms are trained for 450 (in level one) and 4500 (in levels two and three) episodes by running Adam optimizer (Kingma and Ba, 2015) with a ﬁxed learning rate of 0.01. The algorithms are trained for 100k episodes by running Adam optimizer (Kingma and Ba, 2015) with a ﬁxed learning rate of 0.01. In the Mujoco environment, we used the Tanh function in the policies to output the continuous actions and train the algorithms for 10k episodes by running Adam optimizer (Kingma and Ba, 2015) with a ﬁxed learning rate of 0.001. The projected prediction lengths for HLA-OﬀPA2 agents are optimized between 0.001 0.1 in all games. The optimized projected prediction lengths are reported in Table 10.