Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning

Authors: Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, Wei Pan

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of PR2 methods on the iterated matrix games, differential games, and particle world environment. Those games can by design have a non-trivial equilibrium that requires certain levels of intelligent reasonings between agents. We compared our algorithm with a series of baselines.
Researcher Affiliation Academia University College London, Delft University of Technology {ying.wen,yaodong.yang,rui.luo,jun.wang}@cs.ucl.ac.uk {wei.pan}@tudelft.nl
Pseudocode Yes Algorithm 1: Multi-Agent Probabilistic Recursive Reasoning Actor Critic (PR2-AC)." and "Algorithm 2: Multi-Agent Probabilistic Recursive Reasoning Q-Learning (PR2-Q).
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes We adopt the same differential game, the Max of Two Quadratic Game, as Panait et al. (2006); Wei et al. (2018)." and "We further test our method on the multi-state multi-player Particle World Environments (Lowe et al., 2017).
Dataset Splits No The paper specifies training iterations and steps but does not provide explicit details on train/validation/test dataset splits or mention a specific validation split.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper does not provide specific software dependency details, such as library names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For the experiment settings, all the policies and Q-functions are parameterized by the MLP with 2 hidden layers, each with 100 units Re LU activation. The sampling network ξ for the ρ i φ i in SGVD follows the standard normal distribution. In the iterated matrix game, we trained all the methods including the baselines for 500 iterations. In the differential game, we trained the agents for 350 iterations with 25 steps per iteration. For the actor-critic methods, we set the exploration noise to 0.1 in first 1000 steps, and the annealing parameters for PR2-AC and MASQL are set to 0.5 to balance between the exploration and acting as the best response.