Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning
Authors: Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano Albrecht, Josiah Hanna
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct policy evaluation experiments in four domains covering discrete and continuous state and action spaces: a multi-armed bandit problem [Sutton and Barto, 1998], Gridworld [Thomas and Brunskill, 2016], Cart Pole, and Continuous Cart Pole [Brockman et al., 2016]. Since these domains are widely used, we defer their descriptions to Appendix C. Our primary baseline for comparison is on-policy sampling (OS) of i.i.d. trajectories with the Monte Carlo estimator used to compute the final policy value estimate (denoted OS-MC). We also compare to BPG which finds a minimum variance behavior policy for the ordinary importance sampling (OIS) policy value estimator [Hanna et al., 2017] (denoted BPG-OIS). |
| Researcher Affiliation | Academia | Rujie Zhong1, Duohan Zhang2,?, Lukas Schäfer1, Stefano V. Albrecht1, Josiah P. Hanna3,? 1 School of Informatics, University of Edinburgh 2 Statistics Department, University of Wisconsin Madison 3 Computer Sciences Department, University of Wisconsin Madison |
| Pseudocode | Yes | Algorithm 1 Robust On-Policy Sampling. |
| Open Source Code | Yes | We provide an open-source implementation of ROS and all experimental data at https://github.com/uoe-agents/robust_onpolicy_data_collection. |
| Open Datasets | Yes | We conduct policy evaluation experiments in four domains covering discrete and continuous state and action spaces: a multi-armed bandit problem [Sutton and Barto, 1998], Gridworld [Thomas and Brunskill, 2016], Cart Pole, and Continuous Cart Pole [Brockman et al., 2016]. |
| Dataset Splits | No | The paper describes how data is collected and combined (e.g., initial data + additional steps) but does not specify traditional train/validation/test splits of a static dataset, as it focuses on dynamic data collection in RL environments. |
| Hardware Specification | No | Experiments use RL domains and algorithms that can be ran on a typical personal computer. Minimal compute resources required to reproduce any experiment in the paper. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies. It cites PyTorch and NumPy but without version details. |
| Experiment Setup | Yes | The hyper-parameter settings for all experiments are presented in Appendix E. |