Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation
Authors: Pier Giuseppe Sessa, Maryam Kamgarpour, Andreas Krause
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our approach experimentally on an autonomous driving simulation benchmark. H-MARL learns successful equilibrium policies after a few interactions with the environment and can significantly improve the performance compared to non-optimistic exploration methods. |
| Researcher Affiliation | Academia | 1ETH Z urich, R amistrasse 101, 8092 Z urich. 2EPFL Lausanne, Rte Cantonale, 1015 Lausanne. |
| Pseudocode | Yes | Algorithm 1 The H-MARL algorithm |
| Open Source Code | No | The paper mentions using open-source platforms like SMARTS and RLlib, but does not explicitly state that the code for their proposed H-MARL methodology is open-source or provide a link to it. |
| Open Datasets | No | The paper mentions using the "open-source SMARTS autonomous driving platform (Zhou et al., 2020)" as an environment for experiments, which generates data online. It does not provide access information for a pre-existing publicly available dataset. |
| Dataset Splits | No | The paper describes learning through 'sequential interactions with the environment' and doesn't specify any training/validation/test dataset splits. It doesn't use a fixed dataset for which splits would apply. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper names several software components like SMARTS, Bullet physics engine, SUMO, RLlib, GPy Torch, and Adam optimizer, but it does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | Each agent has a discrete action space: {keep lane, slow down, turn right, turn left} and a policy parametrized by a deep neural network with 2 hidden layers of 256 units and tanh activations (we use default policies from Zhou et al. (2020)). The hallucinated optimistic value functions UCBi t( ) are approximated by the sampling approach of Eq. (4) with Z = 5 samples at each time step and βt = 1.0. GP inference is performed on the whole set of past observed trajectories {Dτ}t τ=1 using GPy Torch (Gardner et al., 2018) with Adam (Kingma & Ba, 2014) optimizer for 50 iterations with learning rate l = 0.1. |