Off-Policy Imitation Learning from Observations
Authors: Zhuangdi Zhu, Kaixiang Lin, Bo Dai, Jiayu Zhou
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical results on challenging locomotion tasks indicate that our approach is comparable with state-of-the-art in terms of both sample-efficiency and asymptotic performance. (Abstract) and 5 Experiments We compare OPOLO against state-of-the-art Lf D and Lf O approaches on Mu Ju Co benchmarks, which are locomotion tasks in continuous state-action space. |
| Researcher Affiliation | Collaboration | Zhuangdi Zhu Michigan State University zhuzhuan@msu.edu Kaixiang Lin Michigan State University linkaixi@msu.edu Bo Dai Google Research bodai@google.com Jiayu Zhou Michigan State University jiayuz@msu.edu |
| Pseudocode | Yes | Algorithm 1 Off-POlicy Learning from Observations (OPOLO) |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We compare OPOLO against state-of-the-art Lf D and Lf O approaches on Mu Ju Co benchmarks, which are locomotion tasks in continuous state-action space. |
| Dataset Splits | No | The paper states that experiments are conducted on 'Mu Ju Co benchmarks' and 'For each task, we collect 4 trajectories from a pre-trained expert policy', but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | No | The paper mentions general aspects of the experimental setup such as collecting 4 expert trajectories, removing original rewards, and evaluating results across 5 random seeds. Algorithm 1 also includes `learning rate α`. However, it does not provide specific numerical values for hyperparameters (e.g., learning rate value, batch size, number of epochs) in the main text, stating that 'More experimental details can be found in the supplementary material'. |