Data-Efficient Policy Evaluation Through Behavior Policy Search
Authors: Josiah P. Hanna, Philip S. Thomas, Peter Stone, Scott Niekum
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents an empirical study of variance reduction through behavior policy search. We design our experiments to answer the following questions: Can behavior policy search with BPG reduce policy evaluation MSE compared to on-policy estimates in both tabular and continuous domains? ... We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates. |
| Researcher Affiliation | Academia | Josiah P. Hanna 1 Philip S. Thomas 2 3 Peter Stone 1 Scott Niekum 1 1The University of Texas at Austin, Austin, Texas, USA 2The University of Massachusetts, Amherst, Massachusetts, USA 3Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. |
| Pseudocode | Yes | Algorithm 1 Behavior Policy Gradient |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the availability of its source code. |
| Open Datasets | No | The paper describes using "4x4 Gridworld", "Cartpole Swing Up", and "Acrobot tasks implemented within RLLAB (Duan et al., 2016)". These are common reinforcement learning environments, not static datasets for which direct access links or specific citations for data files are typically provided. The paper does not offer concrete access information for a publicly available dataset in the sense of downloadable files. |
| Dataset Splits | No | The paper operates in an incremental reinforcement learning setting where trajectories are sampled sequentially, rather than using pre-defined train/validation/test splits from a static dataset. "We consider an incremental setting where, at iteration i, we sample a single trajectory Hi with a policy πθi and add {Hi, θi} to a set D." |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper mentions that "Cartpole Swing Up and Acrobot tasks implemented within RLLAB (Duan et al., 2016)" but does not provide specific version numbers for RLLAB or any other software dependencies needed for replication. |
| Experiment Setup | Yes | Algorithm 1 Behavior Policy Gradient Input: Evaluation policy parameters, θe, batch size k, a step-size for each iteration, αi, and number of iterations n. ... We use a constant learning rate of 10-5 for all values of p and run BPG for 500 iterations. ... For a batch size of 100 trajectories. |