Limited Preference Aided Imitation Learning from Imperfect Demonstrations
Authors: Xingchen Cao, Fan-Ming Luo, Junyin Ye, Tian Xu, Zhilong Zhang, Yang Yu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive empirical results across a synthetic task and two locomotion benchmarks show that PAIL surpasses baselines by 73.2% and breaks through the performance bottleneck of imperfect demonstrations. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China; School of Artificial Intelligence, Nanjing University, China; Polixir Technologies, Nanjing, Jiangsu, China. Correspondence to: Yang Yu <yuy@nju.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 PAIL |
| Open Source Code | No | The paper states that components were developed by modifying existing codebases (f-IRL codebase1, BPref codebase2) but does not provide an explicit statement or link for the open-source release of the PAIL algorithm's specific implementation described in the paper. |
| Open Datasets | Yes | We consider a synthetic task, i.e. Grid World, along with 5 locomotion tasks of Mujoco benchmark (Todorov et al., 2012), and 3 locomotion tasks of DMControl (DMC) benchmark (Tassa et al., 2018; Tunyasuvunakool et al., 2020). |
| Dataset Splits | No | The paper describes using 'imperfect demonstrations' and 'training stages of RL' but does not specify explicit training/validation/test dataset splits with percentages or sample counts for reproduction. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running the experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions using SAC, TRPO, f-IRL codebase, and BPref codebase but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Appendix D.5 'Hyper-parameters' provides detailed tables for entropy coefficient β (Table 6), Preference Reward rp (Table 7), Discriminator Reward rd (Table 8), and SAC (Table 9), including specific values for batch size, learning rate, number of layers, hidden dimensions, and activation functions. |