Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning

Authors: Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a large set of experiments and illustrate that our proposed HOG methods outperform the existing ones regarding efficiency and performance.
Researcher Affiliation Academia Ariyan Bighashdel EMAIL Daan de Geus EMAIL Pavol Jancura EMAIL Gijs Dubbelman EMAIL Department of Electrical Engineering Eindhoven University of Technology Eindhoven, 5612 AZ, The Netherlands
Pseudocode Yes Algorithm 1: LOLA-OffPA2 for a set of n self-interested agents (N). Algorithm 2: LA-OffPA2 for a set of n self-interested agents (N). Algorithm 3: HLA-OffPA2 for a set of m common-interested agents (M).
Open Source Code Yes The source code of our OffPA2 framework is available at a Git Hub repository1. 1. https://github.com/tue-mps/OffPA2
Open Datasets Yes We evaluate the methods on the non-differentiable version of the rotational game proposed by Zhang and Lesser (2010), and we refer to it as the Iterated Rotational Game (IRG). Iterated Prisoner s Dilemma (IPD) (Foerster et al., 2018a) is a five-state, two-agent, two-action game with the reward matrices depicted in Table 4. Inspired by Vinitsky et al. (2019), we propose an Exit-Room game with three levels of complexity (see Figure 5). To demonstrate the coordination capability of HLAOffPA2, we propose the Particle-Coordination Game (PCG) in the Particle environment (Lowe et al., 2017). Furthermore, we compare the methods in three games within the multi-agent Mujoco environment (Peng et al., 2021): 1) two-agent Half-Cheetah, 2) two-agent Walker, and 3) two-agent Reacher.
Dataset Splits Yes We created separate validation and test sets for each game that included 100 and 300 randomly generated scenarios, respectively.
Hardware Specification No This work made use of the Dutch national e-infrastructure with the support of the SURF Cooperative using grant no. EINF-6816, which is financed by the Dutch Research Council (NWO).
Software Dependencies No In practice, we don t need to rely on Taylor expansion for the update rules in LOLAOffPA2 as we can use an automatic differentiation engine, e.g., Py Torch autograd (Paszke et al., 2019), to directly compute the gradients. The algorithms are trained for 900 (in IRG) and 50 (in IPD) episodes by running Adam optimizer (Kingma and Ba, 2015) with a fixed learning rate of 0.01. In order to make the state-action value functions anyorder differentiable, we used Si LU nonlinear function (Elfwing et al., 2018) in between the hidden layers. For IRG, we used the Sigmoid function in the policies to output 1-D continues action, and for IPD, we used the Gumble-softmax function (Jang et al., 2017) in the policies to output two discrete actions.
Experiment Setup Yes We employed Multi-Layer Perceptron (MLP) networks with two hidden layers of dimension 64 for policies and value functions. For IRG, we used the Sigmoid function in the policies to output 1-D continues action, and for IPD, we used the Gumble-softmax function (Jang et al., 2017) in the policies to output two discrete actions. The algorithms are trained for 900 (in IRG) and 50 (in IPD) episodes by running Adam optimizer (Kingma and Ba, 2015) with a fixed learning rate of 0.01. The (projected) prediction lengths in OffPA2 and Di CE frameworks are tuned and set to 0.8 and 0.3, respectively. Both policy and value networks consist of two parts: encoder and decoder. The encoders are CNN networks with three convolutional layers (12 90 90 32 21 21 64 9 9 64 7 7 ) and two fully connected layers (3136 512 128 ), with Si LU nonlinear functions (Elfwing et al., 2018) in between. The decoders are MLP networks with two hidden layers of dimension 64 for policies and value functions. The algorithms are trained for 450 (in level one) and 4500 (in levels two and three) episodes by running Adam optimizer (Kingma and Ba, 2015) with a fixed learning rate of 0.01. The algorithms are trained for 100k episodes by running Adam optimizer (Kingma and Ba, 2015) with a fixed learning rate of 0.01. In the Mujoco environment, we used the Tanh function in the policies to output the continuous actions and train the algorithms for 10k episodes by running Adam optimizer (Kingma and Ba, 2015) with a fixed learning rate of 0.001. The projected prediction lengths for HLA-OffPA2 agents are optimized between 0.001 0.1 in all games. The optimized projected prediction lengths are reported in Table 10.