On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations

Authors: Tim G. J. Rudner, Cong Lu, Michael A Osborne, Yarin Gal, Yee Teh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efficiency and online policy performance. We carry out a comparative empirical evaluation of our proposed approach vis-à-vis related methods that integrate offline data into online training. We perform experiments on the Mu Jo Co benchmark suite and the substantially more challenging dexterous hand manipulation tasks.
Researcher Affiliation Academia Tim G. J. Rudner University of Oxford Cong Lu University of Oxford Michael A. Osborne University of Oxford Yarin Gal University of Oxford Yee Whye Teh University of Oxford
Pseudocode Yes Algorithmic details. In our experiments, we use a KL-regularized objective with a standard actor critic implementation and Double DQN [14]. Pseudocode is provided in (Appendix C.1).
Open Source Code Yes Code and visualizations of our results can be found at https://sites.google.com/view/nppac.
Open Datasets Yes We use the expert data from Nair et al. [27], every experiment uses six random seeds, and we use a fixed KL-temperature for each environment class. Mu Jo Co locomotion tasks. We evaluate N-PPAC on three representative tasks: Ant-v2 , Half Cheetah-v2 , and Walker2d-v2 .
Dataset Splits No The paper specifies the number of expert demonstration trajectories used (e.g., '15 demonstration trajectories collected by a pre-trained expert, each containing 1,000 steps' or '25 expert demonstrations...each consisting of 200 environment steps'), but it does not provide explicit training/validation/test dataset splits for this data.
Hardware Specification Yes computed on a Ge Force RTX 3080 GPU.
Software Dependencies No The paper mentions simulation environments like MuJoCo and algorithms such as Double DQN and SAC, but it does not specify software versions for libraries, frameworks, or programming languages used in the implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We use the expert data from Nair et al. [27], every experiment uses six random seeds, and we use a fixed KL-temperature for each environment class. For further implementation details, see Appendix C.2. Specifically, we set the behavioral policy s predictive variance to different constant values in the set {1 10 3, 5 10 3, 1 10 2} (following a similar implementation in Nair et al. [27]).