Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies
Authors: Seyed Kamyar Seyed Ghasemipour, Shixiang (Shane) Gu, Richard Zemel
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We examine the efficacy of our method on a variety of high-dimensional simulated continuous control tasks and observe that SMILe significantly outperforms Meta-BC. Furthermore, we observe that SMILe performs comparably or outperforms Meta-DAgger, while being applicable in the state-only setting and not requiring online experts. To our knowledge, our approach is the first efficient method for Meta-IRL that scales to the function approximator setting. |
| Researcher Affiliation | Collaboration | Seyed Kamyar Seyed Ghasemipour University of Toronto Vector Institute EMAIL Shixiang Gu Google Brain EMAIL Richard Zemel University of Toronto Vector Institute EMAIL |
| Pseudocode | Yes | The SMILe training procedure alternates between generating rollouts and updating models. In this section we present a conceptual overview of SMILe and defer exact details to Algorithm 1 in Appendix A. |
| Open Source Code | Yes | For datasets and reproducing results please refer to https://github.com/ Kamyar Gh/rl_swiss/blob/master/reproducing/smile_paper.md. |
| Open Datasets | Yes | For datasets and reproducing results please refer to https://github.com/ Kamyar Gh/rl_swiss/blob/master/reproducing/smile_paper.md. The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. The meta-training set consists of 32 target positions located at every integer multiple of π/16 radians on the circle. We use 50 meta-train tasks and perform evaluations on 25 meta-test tasks. |
| Dataset Splits | Yes | The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. To evaluate SMILe, we adapt this task for the Few-Shot Imitation Learning setup. Target velocities for meta-train tasks range from 0 to 3, uniformly spaced at 0.1 intervals, and meta-test tasks are defined by the range 0.05 to 2.95, uniformly spaced at 0.1 intervals. The meta-training set consists of 32 target positions located at every integer multiple of π/16 radians on the circle. The meta-testing set consists of 16 targets located at every 2nπ/32 angle on the circle. We use 50 meta-train tasks and perform evaluations on 25 meta-test tasks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions 'simulated continuous control tasks' which does not imply specific hardware. |
| Software Dependencies | No | The paper mentions 'Mujoco benchmarks' and 'Soft-Actor-Critic [16]' (which refers to an algorithm, not a software dependency with a version), but does not provide specific software names with version numbers for reproducibility. |
| Experiment Setup | Yes | The Half Cheetah Random Velocity task is a popular baseline for meta-learning in standard RL. To evaluate SMILe, we adapt this task for the Few-Shot Imitation Learning setup. Each task is defined by a target velocity that we wish a Half Cheetah agent maintain over the duration of an episode; episodes are of length 1000 and start with the agent at standstill. To obtain expert demonstrations, we train an expert policy using Soft-Actor-Critic [16] which observes as part of the state the desired target velocity. We train all models using various amounts of total expert demonstrations and evaluate on the meta-test tasks using context trajectories generated by the pre-trained expert. Results when training on 4, 16, and 64 demonstrations per meta-train task (4 random seeds per model per setting). |