Learning Noise-Induced Reward Functions for Surpassing Demonstrations in Imitation Learning

Authors: Liangyu Huo, Zulin Wang, Mai Xu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on continuous control and highdimensional discrete control tasks show the superiority of our LERP method over other state-of-the-art BD methods.
Researcher Affiliation Academia School of Electronic and Information Engineering, Beihang University 37 Xueyuan Road, Haidian District, Beijing, P.R. China, 100191
Pseudocode No The paper describes its method using prose and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes In the experiment, we evaluated the proposed LERP method on several continuous Mu Jo Co tasks (Todorov, Erez, and Tassa 2012) and discrete Atari tasks (Bellemare et al. 2013) within Open AI Gym (Brockman et al. 2016)
Dataset Splits No The paper describes the generation of synthetic demonstrations for training the reward function, but it does not explicitly define traditional train/validation/test dataset splits with percentages or sample counts for reproducibility in the typical sense.
Hardware Specification No The paper does not provide specific details on the hardware (e.g., GPU, CPU models, or memory) used for running the experiments.
Software Dependencies No The paper mentions software components like Open AI Gym and PPO, but it does not provide specific version numbers for any software dependencies required to replicate the experiments.
Experiment Setup Yes For Mu Jo Co tasks, the under-trained checkpoints were used to generate demonstrations with the length of 1000 timesteps. Then, BC was employed to learn the initial policy π0 with the early stop trick to prevent overfitting. Similar to DREX, we injected 20 levels of noise into π0, i.e., η = {0.00, 0.05, 0.10, ..., 0.95}, and collected 10 interacted trajectories for each level. ... For Atari tasks, we generated 20 initial demonstrations of each task. The noise level η was sampled from {0.05, 0.25, 0.50, 0.75, 0.95}, and we collected 20 trajectories of each level and synthesized 15000 ranked demonstrations.