reproducibilityindex.ai

Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning

Authors: Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, Wei Pan

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of PR2 methods on the iterated matrix games, differential games, and particle world environment. Those games can by design have a non-trivial equilibrium that requires certain levels of intelligent reasonings between agents. We compared our algorithm with a series of baselines.
Researcher Affiliation	Academia	University College London, Delft University of Technology {ying.wen,yaodong.yang,rui.luo,jun.wang}@cs.ucl.ac.uk {wei.pan}@tudelft.nl
Pseudocode	Yes	Algorithm 1: Multi-Agent Probabilistic Recursive Reasoning Actor Critic (PR2-AC)." and "Algorithm 2: Multi-Agent Probabilistic Recursive Reasoning Q-Learning (PR2-Q).
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets	Yes	We adopt the same differential game, the Max of Two Quadratic Game, as Panait et al. (2006); Wei et al. (2018)." and "We further test our method on the multi-state multi-player Particle World Environments (Lowe et al., 2017).
Dataset Splits	No	The paper specifies training iterations and steps but does not provide explicit details on train/validation/test dataset splits or mention a specific validation split.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependency details, such as library names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For the experiment settings, all the policies and Q-functions are parameterized by the MLP with 2 hidden layers, each with 100 units Re LU activation. The sampling network ξ for the ρ i φ i in SGVD follows the standard normal distribution. In the iterated matrix game, we trained all the methods including the baselines for 500 iterations. In the differential game, we trained the agents for 350 iterations with 25 steps per iteration. For the actor-critic methods, we set the exploration noise to 0.1 in ﬁrst 1000 steps, and the annealing parameters for PR2-AC and MASQL are set to 0.5 to balance between the exploration and acting as the best response.