Data-Efficient Policy Evaluation Through Behavior Policy Search

Authors: Josiah P. Hanna, Philip S. Thomas, Peter Stone, Scott Niekum

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents an empirical study of variance reduction through behavior policy search. We design our experiments to answer the following questions: Can behavior policy search with BPG reduce policy evaluation MSE compared to on-policy estimates in both tabular and continuous domains? ... We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates.
Researcher Affiliation Academia Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 1The University of Texas at Austin, Austin, Texas, USA 2The University of Massachusetts, Amherst, Massachusetts, USA 3Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
Pseudocode Yes Algorithm 1 Behavior Policy Gradient
Open Source Code No The paper does not provide any explicit statement or link regarding the availability of its source code.
Open Datasets No The paper describes using "4x4 Gridworld", "Cartpole Swing Up", and "Acrobot tasks implemented within RLLAB (Duan et al., 2016)". These are common reinforcement learning environments, not static datasets for which direct access links or specific citations for data files are typically provided. The paper does not offer concrete access information for a publicly available dataset in the sense of downloadable files.
Dataset Splits No The paper operates in an incremental reinforcement learning setting where trajectories are sampled sequentially, rather than using pre-defined train/validation/test splits from a static dataset. "We consider an incremental setting where, at iteration i, we sample a single trajectory Hi with a policy πθi and add {Hi, θi} to a set D."
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies No The paper mentions that "Cartpole Swing Up and Acrobot tasks implemented within RLLAB (Duan et al., 2016)" but does not provide specific version numbers for RLLAB or any other software dependencies needed for replication.
Experiment Setup Yes Algorithm 1 Behavior Policy Gradient Input: Evaluation policy parameters, θe, batch size k, a step-size for each iteration, αi, and number of iterations n. ... We use a constant learning rate of 10-5 for all values of p and run BPG for 500 iterations. ... For a batch size of 100 trajectories.