Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations

Authors: Tim G. J. Rudner, Cong Lu, Michael A Osborne, Yarin Gal, Yee Teh

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efﬁciency and online policy performance. We carry out a comparative empirical evaluation of our proposed approach vis-à-vis related methods that integrate ofﬂine data into online training. We perform experiments on the Mu Jo Co benchmark suite and the substantially more challenging dexterous hand manipulation tasks.
Researcher Affiliation	Academia	Tim G. J. Rudner University of Oxford Cong Lu University of Oxford Michael A. Osborne University of Oxford Yarin Gal University of Oxford Yee Whye Teh University of Oxford
Pseudocode	Yes	Algorithmic details. In our experiments, we use a KL-regularized objective with a standard actor critic implementation and Double DQN [14]. Pseudocode is provided in (Appendix C.1).
Open Source Code	Yes	Code and visualizations of our results can be found at https://sites.google.com/view/nppac.
Open Datasets	Yes	We use the expert data from Nair et al. [27], every experiment uses six random seeds, and we use a ﬁxed KL-temperature for each environment class. Mu Jo Co locomotion tasks. We evaluate N-PPAC on three representative tasks: Ant-v2 , Half Cheetah-v2 , and Walker2d-v2 .
Dataset Splits	No	The paper specifies the number of expert demonstration trajectories used (e.g., '15 demonstration trajectories collected by a pre-trained expert, each containing 1,000 steps' or '25 expert demonstrations...each consisting of 200 environment steps'), but it does not provide explicit training/validation/test dataset splits for this data.
Hardware Specification	Yes	computed on a Ge Force RTX 3080 GPU.
Software Dependencies	No	The paper mentions simulation environments like MuJoCo and algorithms such as Double DQN and SAC, but it does not specify software versions for libraries, frameworks, or programming languages used in the implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We use the expert data from Nair et al. [27], every experiment uses six random seeds, and we use a ﬁxed KL-temperature for each environment class. For further implementation details, see Appendix C.2. Speciﬁcally, we set the behavioral policy s predictive variance to different constant values in the set {1 10 3, 5 10 3, 1 10 2} (following a similar implementation in Nair et al. [27]).