reproducibilityindex.ai

Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation

Authors: Ruohan Wang, Carlo Ciliberto, Pierluigi Vito Amadori, Yiannis Demiris

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efﬁcacy of our reward function on both discrete and continuous domains, achieving comparable or better performance than the state of the art under different reinforcement learning algorithms.
Researcher Affiliation	Academia	1Department of Electrical and Electronic Engineering, Imperial College London, UK;. Correspondence to: Ruohan Wang <r.wang16@imperial.ac.uk>.
Pseudocode	Yes	Algorithm 1 RANDOM EXPERT DISTILLATION
Open Source Code	Yes	The code for reproducing the experiments is available online.1 1https://github.com/Ruohan W/RED
Open Datasets	No	The paper mentions using "4 trajectories of expert demonstration generated by an expert policy trained with RL" for Mujoco tasks and "a single demonstration provided by a human driver" for the autonomous driving task. While the Mujoco environment is well-known, no concrete access information (link, DOI, specific citation to the exact dataset of expert trajectories used) is provided for these specific expert demonstrations.
Dataset Splits	No	The paper does not explicitly provide details about training/validation/test dataset splits, either by percentages, sample counts, or citations to predefined splits. It describes learning from expert trajectories and evaluating the resulting policy over multiple runs.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU or CPU models, memory, or detailed computer specifications used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., library or framework names with their exact versions) needed to replicate the experiments.
Experiment Setup	Yes	We evaluate the proposed reward function on ﬁve continuous control tasks from the Mujoco environment... using Trust Region Policy Optimization (TRPO) (Schulman et al., 2015)... All RL algorithms terminate within 5M environment steps. For the driving task... We sampled the expert driving actions at 20 Hz. For the environment, we use a vector of size 24 to represent state... We include the terminal reward heuristic deﬁned in Eq. 9... For this task, we initialize all policies with BC and use stochastic value gradient method with experience replay (Heess et al., 2015) as the reinforcement learning algorithm.