On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations
Authors: Tim G. J. Rudner, Cong Lu, Michael A Osborne, Yarin Gal, Yee Teh
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efficiency and online policy performance. We carry out a comparative empirical evaluation of our proposed approach vis-à-vis related methods that integrate offline data into online training. We perform experiments on the Mu Jo Co benchmark suite and the substantially more challenging dexterous hand manipulation tasks. |
| Researcher Affiliation | Academia | Tim G. J. Rudner University of Oxford Cong Lu University of Oxford Michael A. Osborne University of Oxford Yarin Gal University of Oxford Yee Whye Teh University of Oxford |
| Pseudocode | Yes | Algorithmic details. In our experiments, we use a KL-regularized objective with a standard actor critic implementation and Double DQN [14]. Pseudocode is provided in (Appendix C.1). |
| Open Source Code | Yes | Code and visualizations of our results can be found at https://sites.google.com/view/nppac. |
| Open Datasets | Yes | We use the expert data from Nair et al. [27], every experiment uses six random seeds, and we use a fixed KL-temperature for each environment class. Mu Jo Co locomotion tasks. We evaluate N-PPAC on three representative tasks: Ant-v2 , Half Cheetah-v2 , and Walker2d-v2 . |
| Dataset Splits | No | The paper specifies the number of expert demonstration trajectories used (e.g., '15 demonstration trajectories collected by a pre-trained expert, each containing 1,000 steps' or '25 expert demonstrations...each consisting of 200 environment steps'), but it does not provide explicit training/validation/test dataset splits for this data. |
| Hardware Specification | Yes | computed on a Ge Force RTX 3080 GPU. |
| Software Dependencies | No | The paper mentions simulation environments like MuJoCo and algorithms such as Double DQN and SAC, but it does not specify software versions for libraries, frameworks, or programming languages used in the implementation (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We use the expert data from Nair et al. [27], every experiment uses six random seeds, and we use a fixed KL-temperature for each environment class. For further implementation details, see Appendix C.2. Specifically, we set the behavioral policy s predictive variance to different constant values in the set {1 10 3, 5 10 3, 1 10 2} (following a similar implementation in Nair et al. [27]). |