Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
Authors: Philip Thomas, Emma Brunskill
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that MAGIC can produce estimates with orders of magnitude lower mean squared error than the estimates produced by existing algorithms. In this section we present an empirical comparison of different OPE methods. We used three domains: 1) a 4 x 4 gridworld previously constructed specifically for evaluating OPE methods (Thomas, 2015b, Section 2.5); 2) Model Fail...; and 3) Model Win.... Figures 1 and 2 show the mean squared error of different estimators as n, the number of episodes in D, increases. |
| Researcher Affiliation | Academia | Philip S. Thomas PHILIPT@CS.CMU.EDU Emma Brunskill EBRUN@CS.CMU.EDU |
| Pseudocode | Yes | Algorithm 1 MAGIC(D) 1: Input: Historical data, D, evaluation policy, πe, an approximate model, and a set of return-lengths, J . 2: Compute |J | |J | matrix bΩn according to (5). 3: Compute a 90% confidence interval, [l, u], on WDR(D) using the percentile bootstrap method. 4: Compute |J | 1 vector bbn, where bbn(j) = dist(g(Jj)(D), [l, u]). 5: x arg minx |J | x [bΩn + bbnbb n]x 6: return x g J (D) |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the described methodology or links to code repositories. |
| Open Datasets | Yes | We used three domains: 1) a 4 x 4 gridworld previously constructed specifically for evaluating OPE methods (Thomas, 2015b, Section 2.5); 2) Model Fail, a partially observable, deterministic, 4-state domain with horizon L = 2 and in which 3 of the states are aliased (appear identical to the agent), which means that the agent's observations are not Markovian and thus that the approximate (MDP) model is incorrect, even asymptotically; and 3) Model Win, a stochastic 4-state MDP with L = 20, where the model that we use can perfectly represent the true MDP. |
| Dataset Splits | No | The paper describes the data used (historical trajectories) and the number of trials, but does not specify any training, validation, or test splits for this data for model development or evaluation purposes as typically seen in ML experiments. |
| Hardware Specification | No | The paper does not provide any details about the specific hardware used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | No | The paper describes the domains used in experiments and evaluation metrics but does not provide specific hyperparameters or system-level training settings for the OPE algorithms themselves. |