SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies

Authors: Seyed Kamyar Seyed Ghasemipour, Shixiang (Shane) Gu, Richard Zemel

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We examine the efficacy of our method on a variety of high-dimensional simulated continuous control tasks and observe that SMILe significantly outperforms Meta-BC. Furthermore, we observe that SMILe performs comparably or outperforms Meta-DAgger, while being applicable in the state-only setting and not requiring online experts. To our knowledge, our approach is the first efficient method for Meta-IRL that scales to the function approximator setting.
Researcher Affiliation Collaboration Seyed Kamyar Seyed Ghasemipour University of Toronto Vector Institute kamyar@cs.toronto.edu Shixiang Gu Google Brain shanegu@google.com Richard Zemel University of Toronto Vector Institute zemel@cs.toronto.edu
Pseudocode Yes The SMILe training procedure alternates between generating rollouts and updating models. In this section we present a conceptual overview of SMILe and defer exact details to Algorithm 1 in Appendix A.
Open Source Code Yes For datasets and reproducing results please refer to https://github.com/ Kamyar Gh/rl_swiss/blob/master/reproducing/smile_paper.md.
Open Datasets Yes For datasets and reproducing results please refer to https://github.com/ Kamyar Gh/rl_swiss/blob/master/reproducing/smile_paper.md. The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. The meta-training set consists of 32 target positions located at every integer multiple of π/16 radians on the circle. We use 50 meta-train tasks and perform evaluations on 25 meta-test tasks.
Dataset Splits Yes The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. To evaluate SMILe, we adapt this task for the Few-Shot Imitation Learning setup. Target velocities for meta-train tasks range from 0 to 3, uniformly spaced at 0.1 intervals, and meta-test tasks are defined by the range 0.05 to 2.95, uniformly spaced at 0.1 intervals. The meta-training set consists of 32 target positions located at every integer multiple of π/16 radians on the circle. The meta-testing set consists of 16 targets located at every 2nπ/32 angle on the circle. We use 50 meta-train tasks and perform evaluations on 25 meta-test tasks.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions 'simulated continuous control tasks' which does not imply specific hardware.
Software Dependencies No The paper mentions 'Mujoco benchmarks' and 'Soft-Actor-Critic [16]' (which refers to an algorithm, not a software dependency with a version), but does not provide specific software names with version numbers for reproducibility.
Experiment Setup Yes The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. To evaluate SMILe, we adapt this task for the Few-Shot Imitation Learning setup. Each task is defined by a target velocity that we wish a Half Cheetah agent maintain over the duration of an episode; episodes are of length 1000 and start with the agent at standstill. To obtain expert demonstrations, we train an expert policy using Soft-Actor-Critic [16] which observes as part of the state the desired target velocity. We train all models using various amounts of total expert demonstrations and evaluate on the meta-test tasks using context trajectories generated by the pre-trained expert. Results when training on 4, 16, and 64 demonstrations per meta-train task (4 random seeds per model per setting).