Learning Policy Representations in Multiagent Systems

Authors: Aditya Grover, Maruan Al-Shedivat, Jayesh Gupta, Yuri Burda, Harrison Edwards

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically the utility of the proposed framework in (i) a challenging highdimensional competitive environment for continuous control and (ii) a cooperative environment for communication, on supervised predictive tasks, unsupervised clustering, and policy optimization using deep reinforcement learning.
Researcher Affiliation Collaboration 1Stanford University 2Carnegie Mellon University 3Open AI. Correspondence to: Aditya Grover <adityag@cs.stanford.edu>.
Pseudocode Yes Algorithm 1 Learn Policy Embedding Function (fθ)
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No For our analysis, we train a diverse collection of 25 agents, some of which are trained via self-play and others are trained in pairs concurrently using Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017). We start with a fully connected agent-interaction graph (clique) of 25 agents. Every edge in this graph corresponds to 10 rollout episodes involving the corresponding agents. The paper describes generating its own interaction data (episodes) rather than using a pre-existing publicly available dataset, and does not provide access information for this generated data.
Dataset Splits Yes To evaluate weak generalization, we sample a connected subgraph for training with approximately 60% of the edges preserved for training, and remaining split equally for validation and testing. For strong generalization, we preserve 15 agents and their interactions with each other for training, and similarly, 5 agents and their within-group interactions each for validation and testing.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper mentions software like "Mu Jo Co (Todorov et al., 2012)", "Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017)", and "multiagent deep deterministic policy gradients (MADDPG, Lowe et al., 2017)", but does not specify version numbers for these software components or libraries.
Experiment Setup Yes The hyperparameter λ for Emb-Hyb is chosen by grid search over λ {0.01, 0.05, 0.1, 0.5} on a held-out set of interactions. For classification, we use an MLP with 3 hidden layers of 100 units each and the learning objective minimizes the cross entropy error. The maximum length (or horizon) of any episode is 500 time steps. The agent is trained concurrently against all 5 opponents using a distributed version of PPO algorithm, as described in Al-Shedivat et al. (2018).