SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies
Authors: Seyed Kamyar Seyed Ghasemipour, Shixiang (Shane) Gu, Richard Zemel
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We examine the efficacy of our method on a variety of high-dimensional simulated continuous control tasks and observe that SMILe significantly outperforms Meta-BC. Furthermore, we observe that SMILe performs comparably or outperforms Meta-DAgger, while being applicable in the state-only setting and not requiring online experts. To our knowledge, our approach is the first efficient method for Meta-IRL that scales to the function approximator setting. |
| Researcher Affiliation | Collaboration | Seyed Kamyar Seyed Ghasemipour University of Toronto Vector Institute kamyar@cs.toronto.edu Shixiang Gu Google Brain shanegu@google.com Richard Zemel University of Toronto Vector Institute zemel@cs.toronto.edu |
| Pseudocode | Yes | The SMILe training procedure alternates between generating rollouts and updating models. In this section we present a conceptual overview of SMILe and defer exact details to Algorithm 1 in Appendix A. |
| Open Source Code | Yes | For datasets and reproducing results please refer to https://github.com/ Kamyar Gh/rl_swiss/blob/master/reproducing/smile_paper.md. |
| Open Datasets | Yes | For datasets and reproducing results please refer to https://github.com/ Kamyar Gh/rl_swiss/blob/master/reproducing/smile_paper.md. The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. The meta-training set consists of 32 target positions located at every integer multiple of π/16 radians on the circle. We use 50 meta-train tasks and perform evaluations on 25 meta-test tasks. |
| Dataset Splits | Yes | The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. To evaluate SMILe, we adapt this task for the Few-Shot Imitation Learning setup. Target velocities for meta-train tasks range from 0 to 3, uniformly spaced at 0.1 intervals, and meta-test tasks are defined by the range 0.05 to 2.95, uniformly spaced at 0.1 intervals. The meta-training set consists of 32 target positions located at every integer multiple of π/16 radians on the circle. The meta-testing set consists of 16 targets located at every 2nπ/32 angle on the circle. We use 50 meta-train tasks and perform evaluations on 25 meta-test tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions 'simulated continuous control tasks' which does not imply specific hardware. |
| Software Dependencies | No | The paper mentions 'Mujoco benchmarks' and 'Soft-Actor-Critic [16]' (which refers to an algorithm, not a software dependency with a version), but does not provide specific software names with version numbers for reproducibility. |
| Experiment Setup | Yes | The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. To evaluate SMILe, we adapt this task for the Few-Shot Imitation Learning setup. Each task is defined by a target velocity that we wish a Half Cheetah agent maintain over the duration of an episode; episodes are of length 1000 and start with the agent at standstill. To obtain expert demonstrations, we train an expert policy using Soft-Actor-Critic [16] which observes as part of the state the desired target velocity. We train all models using various amounts of total expert demonstrations and evaluate on the meta-test tasks using context trajectories generated by the pre-trained expert. Results when training on 4, 16, and 64 demonstrations per meta-train task (4 random seeds per model per setting). |