Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

Authors: Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano Albrecht, Josiah Hanna

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct policy evaluation experiments in four domains covering discrete and continuous state and action spaces: a multi-armed bandit problem [Sutton and Barto, 1998], Gridworld [Thomas and Brunskill, 2016], Cart Pole, and Continuous Cart Pole [Brockman et al., 2016]. Since these domains are widely used, we defer their descriptions to Appendix C. Our primary baseline for comparison is on-policy sampling (OS) of i.i.d. trajectories with the Monte Carlo estimator used to compute the final policy value estimate (denoted OS-MC). We also compare to BPG which finds a minimum variance behavior policy for the ordinary importance sampling (OIS) policy value estimator [Hanna et al., 2017] (denoted BPG-OIS).
Researcher Affiliation Academia Rujie Zhong1, Duohan Zhang2,?, Lukas Schäfer1, Stefano V. Albrecht1, Josiah P. Hanna3,? 1 School of Informatics, University of Edinburgh 2 Statistics Department, University of Wisconsin Madison 3 Computer Sciences Department, University of Wisconsin Madison
Pseudocode Yes Algorithm 1 Robust On-Policy Sampling.
Open Source Code Yes We provide an open-source implementation of ROS and all experimental data at https://github.com/uoe-agents/robust_onpolicy_data_collection.
Open Datasets Yes We conduct policy evaluation experiments in four domains covering discrete and continuous state and action spaces: a multi-armed bandit problem [Sutton and Barto, 1998], Gridworld [Thomas and Brunskill, 2016], Cart Pole, and Continuous Cart Pole [Brockman et al., 2016].
Dataset Splits No The paper describes how data is collected and combined (e.g., initial data + additional steps) but does not specify traditional train/validation/test splits of a static dataset, as it focuses on dynamic data collection in RL environments.
Hardware Specification No Experiments use RL domains and algorithms that can be ran on a typical personal computer. Minimal compute resources required to reproduce any experiment in the paper.
Software Dependencies No The paper does not provide specific version numbers for software dependencies. It cites PyTorch and NumPy but without version details.
Experiment Setup Yes The hyper-parameter settings for all experiments are presented in Appendix E.