Imitation Learning from Vague Feedback
Authors: Xin-Qiang Cai, Yu-Jie Zhang, Chao-Kai Chiang, Masashi Sugiyama
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our methods outperform standard and preference-based imitation learning methods on various tasks. |
| Researcher Affiliation | Academia | 1 The University of Tokyo, Tokyo, Japan 2 RIKEN AIP, Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 Expert Ratio Estimation; Algorithm 2 COMPILER/COMPILER-E |
| Open Source Code | Yes | 1The code is available on https://github.com/caixq1996/COMPILER. |
| Open Datasets | No | The paper describes how they generated their demonstration pool from policies trained on MuJoCo environments but does not provide access information (link, DOI, specific citation to a public dataset) for this specific demonstration pool. MuJoCo itself is a physics engine, not a dataset in this context. |
| Dataset Splits | No | The paper describes the generation of datasets Γ+ and Γ from a demonstration pool, but it does not specify explicit train/validation/test splits (e.g., percentages or sample counts) for the training of their imitation learning agents. |
| Hardware Specification | No | The paper does not mention any specific hardware specifications such as GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | While various software components and algorithms are mentioned (e.g., PPO, Adam, DDPG, GAIL, AIRL, MuJoCo, OpenAI Baselines), none of them are accompanied by specific version numbers required for reproducibility. |
| Experiment Setup | Yes | We choose Proximal Policy Optimization (PPO) [45] as the basic RL algorithm, and set all hyper-parameters, update frequency, and network architectures of the policy part the same as [46]. Besides, the hyper-parameters of the discriminator for all methods were the same: The discriminator was updated using Adam with a decayed learning rate of 3 × 10−4; the batch size was 256. The ratio of update frequency between the learner and discriminator was 3: 1. |