Limited Preference Aided Imitation Learning from Imperfect Demonstrations

Authors: Xingchen Cao, Fan-Ming Luo, Junyin Ye, Tian Xu, Zhilong Zhang, Yang Yu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive empirical results across a synthetic task and two locomotion benchmarks show that PAIL surpasses baselines by 73.2% and breaks through the performance bottleneck of imperfect demonstrations.
Researcher Affiliation Collaboration 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China; School of Artificial Intelligence, Nanjing University, China; Polixir Technologies, Nanjing, Jiangsu, China. Correspondence to: Yang Yu <yuy@nju.edu.cn>.
Pseudocode Yes Algorithm 1 PAIL
Open Source Code No The paper states that components were developed by modifying existing codebases (f-IRL codebase1, BPref codebase2) but does not provide an explicit statement or link for the open-source release of the PAIL algorithm's specific implementation described in the paper.
Open Datasets Yes We consider a synthetic task, i.e. Grid World, along with 5 locomotion tasks of Mujoco benchmark (Todorov et al., 2012), and 3 locomotion tasks of DMControl (DMC) benchmark (Tassa et al., 2018; Tunyasuvunakool et al., 2020).
Dataset Splits No The paper describes using 'imperfect demonstrations' and 'training stages of RL' but does not specify explicit training/validation/test dataset splits with percentages or sample counts for reproduction.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running the experiments are provided in the paper.
Software Dependencies No The paper mentions using SAC, TRPO, f-IRL codebase, and BPref codebase but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Appendix D.5 'Hyper-parameters' provides detailed tables for entropy coefficient β (Table 6), Preference Reward rp (Table 7), Discriminator Reward rd (Table 8), and SAC (Table 9), including specific values for batch size, learning rate, number of layers, hidden dimensions, and activation functions.