reproducibilityindex.ai

Limited Preference Aided Imitation Learning from Imperfect Demonstrations

Authors: Xingchen Cao, Fan-Ming Luo, Junyin Ye, Tian Xu, Zhilong Zhang, Yang Yu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive empirical results across a synthetic task and two locomotion benchmarks show that PAIL surpasses baselines by 73.2% and breaks through the performance bottleneck of imperfect demonstrations.
Researcher Affiliation	Collaboration	1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China; School of Artificial Intelligence, Nanjing University, China; Polixir Technologies, Nanjing, Jiangsu, China. Correspondence to: Yang Yu <yuy@nju.edu.cn>.
Pseudocode	Yes	Algorithm 1 PAIL
Open Source Code	No	The paper states that components were developed by modifying existing codebases (f-IRL codebase1, BPref codebase2) but does not provide an explicit statement or link for the open-source release of the PAIL algorithm's specific implementation described in the paper.
Open Datasets	Yes	We consider a synthetic task, i.e. Grid World, along with 5 locomotion tasks of Mujoco benchmark (Todorov et al., 2012), and 3 locomotion tasks of DMControl (DMC) benchmark (Tassa et al., 2018; Tunyasuvunakool et al., 2020).
Dataset Splits	No	The paper describes using 'imperfect demonstrations' and 'training stages of RL' but does not specify explicit training/validation/test dataset splits with percentages or sample counts for reproduction.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running the experiments are provided in the paper.
Software Dependencies	No	The paper mentions using SAC, TRPO, f-IRL codebase, and BPref codebase but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Appendix D.5 'Hyper-parameters' provides detailed tables for entropy coefficient β (Table 6), Preference Reward rp (Table 7), Discriminator Reward rd (Table 8), and SAC (Table 9), including specific values for batch size, learning rate, number of layers, hidden dimensions, and activation functions.