Policy Optimization with Demonstrations
Authors: Bingyi Kang, Zequn Jie, Jiashi Feng
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that POf D induces implicit dynamic reward shaping and brings provable beneļ¬ts for policy improvement. Furthermore, it can be combined with policy gradient methods to produce state-of-the-art results, as demonstrated experimentally on a range of popular benchmark sparse-reward tasks, even when the demonstrations are few and imperfect. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, National University of Singapore, Singapore 2Tencent AI Lab, China. |
| Pseudocode | Yes | Algorithm 1 Policy optimization with demonstrations |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for its methodology. |
| Open Datasets | Yes | To comprehensively assess our method, we conduct extensive experiments on eight widely used physical control tasks, ranging from low-dimensional ones such as cartpole (Barto et al., 1983) and mountain car (Moore, 1990) to high-dimensional and naturally sparse environments based on Open AI Gym (Brockman et al., 2016) and Mujoco (Todorov et al., 2012). |
| Dataset Splits | No | The paper mentions using |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments. |
| Software Dependencies | No | Implementation Details Due to space limit, we defer implementation details to the supplementary material. |
| Experiment Setup | No | Implementation Details Due to space limit, we defer implementation details to the supplementary material. |