Imitation Learning from Vague Feedback

Authors: Xin-Qiang Cai, Yu-Jie Zhang, Chao-Kai Chiang, Masashi Sugiyama

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our methods outperform standard and preference-based imitation learning methods on various tasks.
Researcher Affiliation Academia 1 The University of Tokyo, Tokyo, Japan 2 RIKEN AIP, Tokyo, Japan
Pseudocode Yes Algorithm 1 Expert Ratio Estimation; Algorithm 2 COMPILER/COMPILER-E
Open Source Code Yes 1The code is available on https://github.com/caixq1996/COMPILER.
Open Datasets No The paper describes how they generated their demonstration pool from policies trained on MuJoCo environments but does not provide access information (link, DOI, specific citation to a public dataset) for this specific demonstration pool. MuJoCo itself is a physics engine, not a dataset in this context.
Dataset Splits No The paper describes the generation of datasets Γ+ and Γ from a demonstration pool, but it does not specify explicit train/validation/test splits (e.g., percentages or sample counts) for the training of their imitation learning agents.
Hardware Specification No The paper does not mention any specific hardware specifications such as GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types used for running the experiments.
Software Dependencies No While various software components and algorithms are mentioned (e.g., PPO, Adam, DDPG, GAIL, AIRL, MuJoCo, OpenAI Baselines), none of them are accompanied by specific version numbers required for reproducibility.
Experiment Setup Yes We choose Proximal Policy Optimization (PPO) [45] as the basic RL algorithm, and set all hyper-parameters, update frequency, and network architectures of the policy part the same as [46]. Besides, the hyper-parameters of the discriminator for all methods were the same: The discriminator was updated using Adam with a decayed learning rate of 3 × 10−4; the batch size was 256. The ratio of update frequency between the learner and discriminator was 3: 1.