reproducibilityindex.ai

Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance

Authors: Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang, Huaping Liu5109-5116

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Considerable empirical evaluations on a comprehensive collection of benchmarks indicate our method attains consistent improvement over other RLf D counterparts.
Researcher Affiliation	Academia	1Beijing National Research Center for Information Science and Technology (BNRist), State Key Lab on Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2Center for Vision, Cognition, Learning and Autonomy, Department of Computer Science, UCLA, CA 90095, USA
Pseudocode	Yes	Algorithm 1 RLf D with a Soft Constraint
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the described methodology or a link to a code repository.
Open Datasets	Yes	To simulate the sparse reward conditions using existing control tasks in Gym, we ﬁrst propose several reward sparsiﬁcation methods... We train expert policies... for each tested tasks with PPO (Schulman et al. 2017) based on the exact reward... We select six groups of demonstrations with different amounts from 50 to 5000 for comparison on the Half Cheetah task. (Duan et al. 2016; Brockman et al. 2016)
Dataset Splits	No	The paper does not provide explicit training/test/validation dataset splits, such as percentages or sample counts, for reproducing the data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions software components like 'PPO' and 'Gym' but does not specify their version numbers or other ancillary software dependencies with versions.
Experiment Setup	Yes	To simulate the sparse reward conditions using existing control tasks in Gym, we ﬁrst propose several reward sparsiﬁcation methods... the policies of all the methods and tasks are parameterized by the same neural network architecture with two hidden layers (300 and 400 units each) and tanh activation functions. All the algorithms are evaluated within the ﬁxed amount of environment steps. And for every single task, we run each algorithm over ﬁve times with different random seeds. We design four groups of parameters for the ablation experiments on the tolerance choosing in Half Cheetah task, where the annealing mechanism is disabled by setting ϵ ﬁxed at zero, and choose initial tolerance d0 from {100, 10 1, 10 3, 10 6}.