Unlabeled Imperfect Demonstrations in Adversarial Imitation Learning

Authors: Yunke Wang, Bo Du, Chang Xu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on Mu Jo Co and Robo Suite platforms demonstrate the effectiveness of our method from different aspects. In this section, we conduct experiments to verify the effectiveness of UID in various benchmarks (i.e., Mu Jo Co (Todorov, Erez, and Tassa 2012) and Robosuite (Zhu et al. 2020)) under different settings.
Researcher Affiliation Academia 1School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China. 2School of Computer Science, Faculty of Engineering, The University of Sydney, Australia.
Pseudocode Yes Algorithm 1: UID-GAIL
Open Source Code Yes 1https://github.com/yunke-wang/UID
Open Datasets Yes We evaluate UID on three Mu Jo Co (Todorov, Erez, and Tassa 2012) locomotion tasks (i.e., Antv2, Half Cheetah-v2 and Walker2d-v2) firstly. We also conduct experiments on a robot control task in Robosuite (Zhu et al. 2020). We use real-world demonstrations by human operators from Robo Turk website2. 2https://roboturk.stanford.edu/dataset_sim.html
Dataset Splits No No explicit details about train/validation/test dataset splits (e.g., percentages, sample counts for each split, or references to predefined splits with citations) were found.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running the experiments were provided.
Software Dependencies No The paper mentions 'Mu Jo Co' and 'Robosuite' platforms but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We evaluate the agent every 5,000 transitions in training and the reported results are the average of the last 100 evaluations. We add Gaussian noise ξ to the action distribution a of πo to form non-optimal expert πn. The action of πn is modeled as a N(a , ξ2) and we choose ξ = [0.25, 0.4, 0.6] in these 3 non-optimal policies.