Learning Noise-Induced Reward Functions for Surpassing Demonstrations in Imitation Learning
Authors: Liangyu Huo, Zulin Wang, Mai Xu
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on continuous control and highdimensional discrete control tasks show the superiority of our LERP method over other state-of-the-art BD methods. |
| Researcher Affiliation | Academia | School of Electronic and Information Engineering, Beihang University 37 Xueyuan Road, Haidian District, Beijing, P.R. China, 100191 |
| Pseudocode | No | The paper describes its method using prose and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | In the experiment, we evaluated the proposed LERP method on several continuous Mu Jo Co tasks (Todorov, Erez, and Tassa 2012) and discrete Atari tasks (Bellemare et al. 2013) within Open AI Gym (Brockman et al. 2016) |
| Dataset Splits | No | The paper describes the generation of synthetic demonstrations for training the reward function, but it does not explicitly define traditional train/validation/test dataset splits with percentages or sample counts for reproducibility in the typical sense. |
| Hardware Specification | No | The paper does not provide specific details on the hardware (e.g., GPU, CPU models, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like Open AI Gym and PPO, but it does not provide specific version numbers for any software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | For Mu Jo Co tasks, the under-trained checkpoints were used to generate demonstrations with the length of 1000 timesteps. Then, BC was employed to learn the initial policy π0 with the early stop trick to prevent overfitting. Similar to DREX, we injected 20 levels of noise into π0, i.e., η = {0.00, 0.05, 0.10, ..., 0.95}, and collected 10 interacted trajectories for each level. ... For Atari tasks, we generated 20 initial demonstrations of each task. The noise level η was sampled from {0.05, 0.25, 0.50, 0.75, 0.95}, and we collected 20 trajectories of each level and synthesized 15000 ranked demonstrations. |