Relational Forward Models for Multi-Agent Learning
Authors: Andrea Tacchetti, H. Francis Song, Pedro A. M. Mediano, Vinicius Zambaldi, János Kramár, Neil C. Rabinowitz, Thore Graepel, Matthew Botvinick, Peter W. Battaglia
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we show that our models can surpass previous top methods on this task (Kipf et al., 2018; Hoshen, 2017). Perhaps more importantly, they produce intermediate representations that support the social analysis of multi-agent systems: we use our models to propose a new way to characterize what drives each agent s behavior, track when agents influence each other, and identify which factors in the environment mediate the presence and valence of social interactions. Finally, we embed our models inside agents and use them to augment the host agent s observations with predictions of others behavior. Our results show that this leads to agents that learn to coordinate with one another faster than non-augmented baselines. |
| Researcher Affiliation | Industry | Google Deep Mind {atacchet,songf,pmediano,vzambaldi janosk,ncr,thore,botvinick,peterbattaglia}@google.com |
| Pseudocode | No | The paper describes the model architecture and computational steps in prose and diagrams (e.g., Fig. 1), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We considered three multi-agent environments for our study: Cooperative Navigation (Lowe et al., 2017), Coin Game (Raileanu et al., 2018) and Stag Hunt (Peysakhovich & Lerer, 2017b). |
| Dataset Splits | No | The paper mentions collecting '500,000 episodes of behavioral trajectories' for training and '2,500 further episode trajectories for performance reporting and analysis' (referred to as 'held-out episodes'), but it does not specify explicit training, validation, and test splits with percentages, exact counts for each split, or a cross-validation setup. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models, memory, or specific cloud computing instances. |
| Software Dependencies | No | The paper mentions using TensorFlow ('Tensor Flow checkpoint loader') and refers to various network architectures like MLPs, GRUs (Cho et al., 2014), and LSTMs, but it does not provide specific version numbers for any software libraries or frameworks used in the experiments. |
| Experiment Setup | Yes | Training of both RFM and baseline models was conducted using gradient descent to minimize the cross-entropy loss between predicted and ground-truth actions. The training procedure was halted after one million steps, during each of which the gradient was estimated using a batch of 128 episodes. [...] Architecture details are as follows: input graphs Gt in go through a GN encoder block, a basic GN module whose φv, φe and φu are three separate 64-unit MLPs, with 1 hidden layer, and Re LU activations and whose ρ functions are summations. The output of the GN encoder block is used, in conjunction with a state graph Gt 1 hid , in a Graph GRU , where each φ function is a Gated Recurrent Unit (GRU) (Cho et al., 2014) with a hidden state size of 32 for each of vertices, edges and globals. |