Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play

Authors: Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, Rob Fergus

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The following experiments explore our self-play approach on a variety of tasks, both continuous and discrete, from the Mazebase (Sukhbaatar et al., 2015), RLLab (Duan et al., 2016), and Star Craft (Synnaeve et al., 2016) environments. The same protocol is used in all settings: self-play and target task episodes are mixed together and used to train the agent via discrete policy gradient.
Researcher Affiliation Collaboration Sainbayar Sukhbaatar Dept. of Computer Science New York University sainbar@cs.nyu.edu Zeming Lin Facebook AI Research New York zlin@fb.com Ilya Kostrikov Dept. of Computer Science New York University kostrikov@cs.nyu.edu Gabriel Synnaeve, Arthur Szlam & Rob Fergus Facebook AI Research New York {gab,aszlam,robfergus}@fb.com
Pseudocode Yes Algorithm 1 Pseudo code for training an agent on a self-play episode; Algorithm 2 Pseudo code for training an agent on a target task episode
Open Source Code Yes Code for our approach can be found at (link removed for anonymity).
Open Datasets Yes The following experiments explore our self-play approach on a variety of tasks, both continuous and discrete, from the Mazebase (Sukhbaatar et al., 2015), RLLab (Duan et al., 2016), and Star Craft (Synnaeve et al., 2016) environments.
Dataset Splits No The paper describes mixing 'self-play episodes' and 'target task episodes' for training (e.g., '25% comes from target task episodes, while the remaining 75% is from self-play'), but it does not specify traditional train/validation/test dataset splits from fixed datasets, as the environments are often randomly generated per episode.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions several software/libraries like Mujoco, RMSProp, TRPO, but it does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes For the experiments with neural networks, all parameters are randomly initialized from N(0, 0.2). The Hyperparameters of RMSProp are set to 0.97 and 1e-6. The other hyperparameter values used in the experiments are shown in Table 1.