Modeling Others using Oneself in Multi-Agent Reinforcement Learning

Authors: Roberta Raileanu, Emily Denton, Arthur Szlam, Rob Fergus

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this approach on three different tasks and show that the agents are able to learn better policies using their estimate of the other players goals, in both cooperative and competitive settings.
Researcher Affiliation Collaboration 1New York University, New York City, USA 2Facebook AI Research, New York City, USA.
Pseudocode Yes Algorithm 1 represents the pseudo-code for training a SOM agent for one episode.
Open Source Code No The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes All the tasks considered are created in the Mazebase gridworld environment (Sukhbaatar et al., 2015).
Dataset Splits No The paper mentions training models ('In all our experiments, we train the agents policies...') but does not explicitly describe a validation dataset split or a methodology like cross-validation for hyperparameter tuning.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies No The paper mentions various algorithms and models used (e.g., 'Asynchronous Advantage Actor-Critic (A3C)', 'Adam', 'LSTM', 'ELU'), but it does not specify any software names with version numbers, such as programming language versions or library versions (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup Yes In all our experiments, we train the agents policies using A3C (Mnih et al., 2016) with an entropy coefficient of 0.01, a value loss coefficient of 0.5, and a discount factor of 0.99. The parameters of the agents policies are optimized using Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, ϵ = 1 10 8, and weight decay 0. SGD with a learning rate of 0.1 was used for inferring the other agent s goal, zother. The hidden layer dimension of the policy network was 64 for the Coin and Recipe Games and 128 for the Door Game. We use a learning rate of 1 10 4 for all games and models. All the results shown are for 10 optimization updates of zother at each step in the game, unless mentioned otherwise.