reproducibilityindex.ai

Modeling Others using Oneself in Multi-Agent Reinforcement Learning

Authors: Roberta Raileanu, Emily Denton, Arthur Szlam, Rob Fergus

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this approach on three different tasks and show that the agents are able to learn better policies using their estimate of the other players goals, in both cooperative and competitive settings.
Researcher Affiliation	Collaboration	1New York University, New York City, USA 2Facebook AI Research, New York City, USA.
Pseudocode	Yes	Algorithm 1 represents the pseudo-code for training a SOM agent for one episode.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	All the tasks considered are created in the Mazebase gridworld environment (Sukhbaatar et al., 2015).
Dataset Splits	No	The paper mentions training models ('In all our experiments, we train the agents policies...') but does not explicitly describe a validation dataset split or a methodology like cross-validation for hyperparameter tuning.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies	No	The paper mentions various algorithms and models used (e.g., 'Asynchronous Advantage Actor-Critic (A3C)', 'Adam', 'LSTM', 'ELU'), but it does not specify any software names with version numbers, such as programming language versions or library versions (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup	Yes	In all our experiments, we train the agents policies using A3C (Mnih et al., 2016) with an entropy coefﬁcient of 0.01, a value loss coefﬁcient of 0.5, and a discount factor of 0.99. The parameters of the agents policies are optimized using Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, ϵ = 1 10 8, and weight decay 0. SGD with a learning rate of 0.1 was used for inferring the other agent s goal, zother. The hidden layer dimension of the policy network was 64 for the Coin and Recipe Games and 128 for the Door Game. We use a learning rate of 1 10 4 for all games and models. All the results shown are for 10 optimization updates of zother at each step in the game, unless mentioned otherwise.