Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance
Authors: Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang, Huaping Liu5109-5116
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Considerable empirical evaluations on a comprehensive collection of benchmarks indicate our method attains consistent improvement over other RLf D counterparts. |
| Researcher Affiliation | Academia | 1Beijing National Research Center for Information Science and Technology (BNRist), State Key Lab on Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2Center for Vision, Cognition, Learning and Autonomy, Department of Computer Science, UCLA, CA 90095, USA |
| Pseudocode | Yes | Algorithm 1 RLf D with a Soft Constraint |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | To simulate the sparse reward conditions using existing control tasks in Gym, we first propose several reward sparsification methods... We train expert policies... for each tested tasks with PPO (Schulman et al. 2017) based on the exact reward... We select six groups of demonstrations with different amounts from 50 to 5000 for comparison on the Half Cheetah task. (Duan et al. 2016; Brockman et al. 2016) |
| Dataset Splits | No | The paper does not provide explicit training/test/validation dataset splits, such as percentages or sample counts, for reproducing the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like 'PPO' and 'Gym' but does not specify their version numbers or other ancillary software dependencies with versions. |
| Experiment Setup | Yes | To simulate the sparse reward conditions using existing control tasks in Gym, we first propose several reward sparsification methods... the policies of all the methods and tasks are parameterized by the same neural network architecture with two hidden layers (300 and 400 units each) and tanh activation functions. All the algorithms are evaluated within the fixed amount of environment steps. And for every single task, we run each algorithm over five times with different random seeds. We design four groups of parameters for the ablation experiments on the tolerance choosing in Half Cheetah task, where the annealing mechanism is disabled by setting ϵ fixed at zero, and choose initial tolerance d0 from {100, 10 1, 10 3, 10 6}. |