Off-Policy Imitation Learning from Observations

Authors: Zhuangdi Zhu, Kaixiang Lin, Bo Dai, Jiayu Zhou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical results on challenging locomotion tasks indicate that our approach is comparable with state-of-the-art in terms of both sample-efficiency and asymptotic performance. (Abstract) and 5 Experiments We compare OPOLO against state-of-the-art Lf D and Lf O approaches on Mu Ju Co benchmarks, which are locomotion tasks in continuous state-action space.
Researcher Affiliation Collaboration Zhuangdi Zhu Michigan State University zhuzhuan@msu.edu Kaixiang Lin Michigan State University linkaixi@msu.edu Bo Dai Google Research bodai@google.com Jiayu Zhou Michigan State University jiayuz@msu.edu
Pseudocode Yes Algorithm 1 Off-POlicy Learning from Observations (OPOLO)
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes We compare OPOLO against state-of-the-art Lf D and Lf O approaches on Mu Ju Co benchmarks, which are locomotion tasks in continuous state-action space.
Dataset Splits No The paper states that experiments are conducted on 'Mu Ju Co benchmarks' and 'For each task, we collect 4 trajectories from a pre-trained expert policy', but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup No The paper mentions general aspects of the experimental setup such as collecting 4 expert trajectories, removing original rewards, and evaluating results across 5 random seeds. Algorithm 1 also includes `learning rate α`. However, it does not provide specific numerical values for hyperparameters (e.g., learning rate value, batch size, number of epochs) in the main text, stating that 'More experimental details can be found in the supplementary material'.