reproducibilityindex.ai

Imitation Learning from Vague Feedback

Authors: Xin-Qiang Cai, Yu-Jie Zhang, Chao-Kai Chiang, Masashi Sugiyama

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our methods outperform standard and preference-based imitation learning methods on various tasks.
Researcher Affiliation	Academia	1 The University of Tokyo, Tokyo, Japan 2 RIKEN AIP, Tokyo, Japan
Pseudocode	Yes	Algorithm 1 Expert Ratio Estimation; Algorithm 2 COMPILER/COMPILER-E
Open Source Code	Yes	1The code is available on https://github.com/caixq1996/COMPILER.
Open Datasets	No	The paper describes how they generated their demonstration pool from policies trained on MuJoCo environments but does not provide access information (link, DOI, specific citation to a public dataset) for this specific demonstration pool. MuJoCo itself is a physics engine, not a dataset in this context.
Dataset Splits	No	The paper describes the generation of datasets Γ+ and Γ from a demonstration pool, but it does not specify explicit train/validation/test splits (e.g., percentages or sample counts) for the training of their imitation learning agents.
Hardware Specification	No	The paper does not mention any specific hardware specifications such as GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types used for running the experiments.
Software Dependencies	No	While various software components and algorithms are mentioned (e.g., PPO, Adam, DDPG, GAIL, AIRL, MuJoCo, OpenAI Baselines), none of them are accompanied by specific version numbers required for reproducibility.
Experiment Setup	Yes	We choose Proximal Policy Optimization (PPO) [45] as the basic RL algorithm, and set all hyper-parameters, update frequency, and network architectures of the policy part the same as [46]. Besides, the hyper-parameters of the discriminator for all methods were the same: The discriminator was updated using Adam with a decayed learning rate of 3 × 10−4; the batch size was 256. The ratio of update frequency between the learner and discriminator was 3: 1.