Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation

Authors: Pier Giuseppe Sessa, Maryam Kamgarpour, Andreas Krause

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our approach experimentally on an autonomous driving simulation benchmark. H-MARL learns successful equilibrium policies after a few interactions with the environment and can significantly improve the performance compared to non-optimistic exploration methods.
Researcher Affiliation Academia 1ETH Z urich, R amistrasse 101, 8092 Z urich. 2EPFL Lausanne, Rte Cantonale, 1015 Lausanne.
Pseudocode Yes Algorithm 1 The H-MARL algorithm
Open Source Code No The paper mentions using open-source platforms like SMARTS and RLlib, but does not explicitly state that the code for their proposed H-MARL methodology is open-source or provide a link to it.
Open Datasets No The paper mentions using the "open-source SMARTS autonomous driving platform (Zhou et al., 2020)" as an environment for experiments, which generates data online. It does not provide access information for a pre-existing publicly available dataset.
Dataset Splits No The paper describes learning through 'sequential interactions with the environment' and doesn't specify any training/validation/test dataset splits. It doesn't use a fixed dataset for which splits would apply.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper names several software components like SMARTS, Bullet physics engine, SUMO, RLlib, GPy Torch, and Adam optimizer, but it does not specify their version numbers, which is required for reproducibility.
Experiment Setup Yes Each agent has a discrete action space: {keep lane, slow down, turn right, turn left} and a policy parametrized by a deep neural network with 2 hidden layers of 256 units and tanh activations (we use default policies from Zhou et al. (2020)). The hallucinated optimistic value functions UCBi t( ) are approximated by the sampling approach of Eq. (4) with Z = 5 samples at each time step and βt = 1.0. GP inference is performed on the whole set of past observed trajectories {Dτ}t τ=1 using GPy Torch (Gardner et al., 2018) with Adam (Kingma & Ba, 2014) optimizer for 50 iterations with learning rate l = 0.1.