reproducibilityindex.ai

Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning

Authors: Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano Albrecht, Josiah Hanna

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct policy evaluation experiments in four domains covering discrete and continuous state and action spaces: a multi-armed bandit problem [Sutton and Barto, 1998], Gridworld [Thomas and Brunskill, 2016], Cart Pole, and Continuous Cart Pole [Brockman et al., 2016]. Since these domains are widely used, we defer their descriptions to Appendix C. Our primary baseline for comparison is on-policy sampling (OS) of i.i.d. trajectories with the Monte Carlo estimator used to compute the ﬁnal policy value estimate (denoted OS-MC). We also compare to BPG which ﬁnds a minimum variance behavior policy for the ordinary importance sampling (OIS) policy value estimator [Hanna et al., 2017] (denoted BPG-OIS).
Researcher Affiliation	Academia	Rujie Zhong1, Duohan Zhang2,?, Lukas Schäfer1, Stefano V. Albrecht1, Josiah P. Hanna3,? 1 School of Informatics, University of Edinburgh 2 Statistics Department, University of Wisconsin Madison 3 Computer Sciences Department, University of Wisconsin Madison
Pseudocode	Yes	Algorithm 1 Robust On-Policy Sampling.
Open Source Code	Yes	We provide an open-source implementation of ROS and all experimental data at https://github.com/uoe-agents/robust_onpolicy_data_collection.
Open Datasets	Yes	We conduct policy evaluation experiments in four domains covering discrete and continuous state and action spaces: a multi-armed bandit problem [Sutton and Barto, 1998], Gridworld [Thomas and Brunskill, 2016], Cart Pole, and Continuous Cart Pole [Brockman et al., 2016].
Dataset Splits	No	The paper describes how data is collected and combined (e.g., initial data + additional steps) but does not specify traditional train/validation/test splits of a static dataset, as it focuses on dynamic data collection in RL environments.
Hardware Specification	No	Experiments use RL domains and algorithms that can be ran on a typical personal computer. Minimal compute resources required to reproduce any experiment in the paper.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies. It cites PyTorch and NumPy but without version details.
Experiment Setup	Yes	The hyper-parameter settings for all experiments are presented in Appendix E.