Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments

Authors: Maruan Al-Shedivat, Trapit Bansal, Yura Burda, Ilya Sutskever, Igor Mordatch, Pieter Abbeel

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with a population of agents that learn and compete suggest that meta-learners are the fittest. We evaluate our meta-learning agents along with a number of baselines on a (single-agent) locomotion task with handcrafted nonstationarity and on iterated adaptation games in Robo Sumo. Our results demonstrate that meta-learned strategies clearly dominate other adaptation methods in the few-shot regime in both single- and multi-agent settings.
Researcher Affiliation Collaboration Maruan Al-Shedivat CMU Trapit Bansal UMass Amherst Yura Burda Open AI Ilya Sutskever Open AI Igor Mordatch Open AI Pieter Abbeel UC Berkeley
Pseudocode Yes Algorithm 1 Meta-learning at training time. Algorithm 2 Adaptation at execution time.
Open Source Code No The paper provides a link to demonstration videos ("Videos that demonstrate adaptation behaviors are available at https://goo.gl/tboqaN") but does not explicitly state that the source code for their methodology is publicly available.
Open Datasets No The paper describes custom environments ("we design Robo Sumo", "we consider the problem of robotic locomotion in a changing environment") and data generation within them, but it does not mention the use of a publicly available or open dataset, nor does it provide access details for the data generated.
Dataset Splits No The paper describes splitting environments into training and testing sets (e.g., "12 training and 3 testing environments") but does not explicitly mention a separate validation set or split for model tuning.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions "Mu Jo Co physics simulator (Todorov et al., 2012)" and algorithms like PPO, but it does not provide specific version numbers for any software libraries, frameworks, or simulators used.
Experiment Setup Yes The epoch size of the PPO was set 32,000 episodes and the batch size was set to 8,000. The PPO clipping hyperparameter was set to ϵ = 0.2 and the KL penalty was set to 0. In all our experiments, the learning rate (for meta-learning, the learning rate for θ and α) was set to 0.0003. The generalized advantage function estimator (GAE) (Schulman et al., 2015b) was optimized jointly with the policy (we used γ = 0.995 and λ = 0.95).