reproducibilityindex.ai

Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation

Authors: Pier Giuseppe Sessa, Maryam Kamgarpour, Andreas Krause

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our approach experimentally on an autonomous driving simulation benchmark. H-MARL learns successful equilibrium policies after a few interactions with the environment and can significantly improve the performance compared to non-optimistic exploration methods.
Researcher Affiliation	Academia	1ETH Z urich, R amistrasse 101, 8092 Z urich. 2EPFL Lausanne, Rte Cantonale, 1015 Lausanne.
Pseudocode	Yes	Algorithm 1 The H-MARL algorithm
Open Source Code	No	The paper mentions using open-source platforms like SMARTS and RLlib, but does not explicitly state that the code for their proposed H-MARL methodology is open-source or provide a link to it.
Open Datasets	No	The paper mentions using the "open-source SMARTS autonomous driving platform (Zhou et al., 2020)" as an environment for experiments, which generates data online. It does not provide access information for a pre-existing publicly available dataset.
Dataset Splits	No	The paper describes learning through 'sequential interactions with the environment' and doesn't specify any training/validation/test dataset splits. It doesn't use a fixed dataset for which splits would apply.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper names several software components like SMARTS, Bullet physics engine, SUMO, RLlib, GPy Torch, and Adam optimizer, but it does not specify their version numbers, which is required for reproducibility.
Experiment Setup	Yes	Each agent has a discrete action space: {keep lane, slow down, turn right, turn left} and a policy parametrized by a deep neural network with 2 hidden layers of 256 units and tanh activations (we use default policies from Zhou et al. (2020)). The hallucinated optimistic value functions UCBi t( ) are approximated by the sampling approach of Eq. (4) with Z = 5 samples at each time step and βt = 1.0. GP inference is performed on the whole set of past observed trajectories {Dτ}t τ=1 using GPy Torch (Gardner et al., 2018) with Adam (Kingma & Ba, 2014) optimizer for 50 iterations with learning rate l = 0.1.