Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning
Authors: Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano Albrecht, Josiah Hanna
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct policy evaluation experiments in four domains covering discrete and continuous state and action spaces: a multi-armed bandit problem [Sutton and Barto, 1998], Gridworld [Thomas and Brunskill, 2016], Cart Pole, and Continuous Cart Pole [Brockman et al., 2016]. Since these domains are widely used, we defer their descriptions to Appendix C. Our primary baseline for comparison is on-policy sampling (OS) of i.i.d. trajectories with the Monte Carlo estimator used to compute the final policy value estimate (denoted OS-MC). We also compare to BPG which finds a minimum variance behavior policy for the ordinary importance sampling (OIS) policy value estimator [Hanna et al., 2017] (denoted BPG-OIS). |
| Researcher Affiliation | Academia | Rujie Zhong1, Duohan Zhang2,?, Lukas Schäfer1, Stefano V. Albrecht1, Josiah P. Hanna3,? 1 School of Informatics, University of Edinburgh 2 Statistics Department, University of Wisconsin Madison 3 Computer Sciences Department, University of Wisconsin Madison |
| Pseudocode | Yes | Algorithm 1 Robust On-Policy Sampling. |
| Open Source Code | Yes | We provide an open-source implementation of ROS and all experimental data at https://github.com/uoe-agents/robust_onpolicy_data_collection. |
| Open Datasets | Yes | We conduct policy evaluation experiments in four domains covering discrete and continuous state and action spaces: a multi-armed bandit problem [Sutton and Barto, 1998], Gridworld [Thomas and Brunskill, 2016], Cart Pole, and Continuous Cart Pole [Brockman et al., 2016]. |
| Dataset Splits | No | The paper describes how data is collected and combined (e.g., initial data + additional steps) but does not specify traditional train/validation/test splits of a static dataset, as it focuses on dynamic data collection in RL environments. |
| Hardware Specification | No | Experiments use RL domains and algorithms that can be ran on a typical personal computer. Minimal compute resources required to reproduce any experiment in the paper. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies. It cites PyTorch and NumPy but without version details. |
| Experiment Setup | Yes | The hyper-parameter settings for all experiments are presented in Appendix E. |