Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables

Authors: Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, Deirdre Quillen

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experimental evaluation, we demonstrate state-of-the-art results with 20-100x improvement in meta-training sample efficiency and substantial increases in asymptotic performance on six continuous control meta-learning domains. We examine how our model conducts structured exploration to adapt rapidly to new tasks in a 2-D navigation environment with sparse rewards. Our open-source implementation of PEARL can be found at https://github.com/katerakelly/oyster.6. Experiments In our experiments, we assess the performance of our method and analyze its properties. We first evaluate how our approach compares to prior meta-RL methods, especially in terms of sample efficiency, on several benchmark meta-RL problems in Section 6.1. We examine how probabilistic context and posterior sampling enable rapid adaptation via structured exploration strategies in sparse reward settings in Section 6.2. Finally, in Section 6.3, we evaluate the specific design choices in our algorithm through ablations.
Researcher Affiliation Academia Kate Rakelly 1 * Aurick Zhou 1 * Deirdre Quillen 1 Chelsea Finn 1 Sergey Levine 1 1EECS Department, UC Berkeley, Berkeley, CA, USA. Correspondence to: Kate Rakelly <rakelly@eecs.berkeley.edu>, Aurick Zhou <azhou@eecs.berkeley.edu>.
Pseudocode Yes Algorithm 1 PEARL Meta-training
Open Source Code Yes Our open-source implementation of PEARL can be found at https://github.com/katerakelly/oyster.
Open Datasets Yes We evaluate PEARL on six continuous control environments focused around robotic locomotion, simulated via the Mu Jo Co simulator (Todorov et al., 2012). These locomotion task families... These meta-RL benchmarks were previously introduced by Finn et al. (2017) and Rothfuss et al. (2018).
Dataset Splits No The paper describes meta-training on a distribution of tasks and meta-testing on new tasks from the same distribution. While it mentions '100 random goals for training and 20 for testing' for the 2D navigation task, it does not provide general or explicit training/validation/test dataset splits (e.g., percentages, explicit counts or specific split methodologies) for all its experiments, nor does it specify validation sets.
Hardware Specification No The paper mentions using the Mu Jo Co simulator but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions software like the Mu Jo Co simulator and algorithms like Soft Actor-Critic (SAC) and PPO, but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes All tasks have horizon length 200. We compare to existing policy gradient meta-RL methods Pro MP (Rothfuss et al., 2018) and MAML-TRPO (Finn et al., 2017) using publicly available code. We also re-implement the recurrence-based policy gradient RL2 method (Duan et al., 2016) with PPO (Schulman et al., 2017). The results of each algorithm are averaged across three random seeds. The context sampler Sc samples uniformly from the most recently collected batch of data, recollected every 1000 meta-training optimization steps. The actor and critic are trained with batches of transitions drawn uniformly from the entire replay buffer. A reward is given only when the agent is within a certain radius of the goal. We experiment with radius 0.2 and 0.8. The horizon length is 20 steps. We backprop through the RNN to 100 timesteps.