Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation
Authors: Ruohan Wang, Carlo Ciliberto, Pierluigi Vito Amadori, Yiannis Demiris
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our reward function on both discrete and continuous domains, achieving comparable or better performance than the state of the art under different reinforcement learning algorithms. |
| Researcher Affiliation | Academia | 1Department of Electrical and Electronic Engineering, Imperial College London, UK;. Correspondence to: Ruohan Wang <r.wang16@imperial.ac.uk>. |
| Pseudocode | Yes | Algorithm 1 RANDOM EXPERT DISTILLATION |
| Open Source Code | Yes | The code for reproducing the experiments is available online.1 1https://github.com/Ruohan W/RED |
| Open Datasets | No | The paper mentions using "4 trajectories of expert demonstration generated by an expert policy trained with RL" for Mujoco tasks and "a single demonstration provided by a human driver" for the autonomous driving task. While the Mujoco environment is well-known, no concrete access information (link, DOI, specific citation to the exact dataset of expert trajectories used) is provided for these specific expert demonstrations. |
| Dataset Splits | No | The paper does not explicitly provide details about training/validation/test dataset splits, either by percentages, sample counts, or citations to predefined splits. It describes learning from expert trajectories and evaluating the resulting policy over multiple runs. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU or CPU models, memory, or detailed computer specifications used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., library or framework names with their exact versions) needed to replicate the experiments. |
| Experiment Setup | Yes | We evaluate the proposed reward function on five continuous control tasks from the Mujoco environment... using Trust Region Policy Optimization (TRPO) (Schulman et al., 2015)... All RL algorithms terminate within 5M environment steps. For the driving task... We sampled the expert driving actions at 20 Hz. For the environment, we use a vector of size 24 to represent state... We include the terminal reward heuristic defined in Eq. 9... For this task, we initialize all policies with BC and use stochastic value gradient method with experience replay (Heess et al., 2015) as the reinforcement learning algorithm. |