Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Authors: Philip Thomas, Emma Brunskill

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show that MAGIC can produce estimates with orders of magnitude lower mean squared error than the estimates produced by existing algorithms. In this section we present an empirical comparison of different OPE methods. We used three domains: 1) a 4 x 4 gridworld previously constructed specifically for evaluating OPE methods (Thomas, 2015b, Section 2.5); 2) Model Fail...; and 3) Model Win.... Figures 1 and 2 show the mean squared error of different estimators as n, the number of episodes in D, increases.
Researcher Affiliation Academia Philip S. Thomas PHILIPT@CS.CMU.EDU Emma Brunskill EBRUN@CS.CMU.EDU
Pseudocode Yes Algorithm 1 MAGIC(D) 1: Input: Historical data, D, evaluation policy, πe, an approximate model, and a set of return-lengths, J . 2: Compute |J | |J | matrix bΩn according to (5). 3: Compute a 90% confidence interval, [l, u], on WDR(D) using the percentile bootstrap method. 4: Compute |J | 1 vector bbn, where bbn(j) = dist(g(Jj)(D), [l, u]). 5: x arg minx |J | x [bΩn + bbnbb n]x 6: return x g J (D)
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology or links to code repositories.
Open Datasets Yes We used three domains: 1) a 4 x 4 gridworld previously constructed specifically for evaluating OPE methods (Thomas, 2015b, Section 2.5); 2) Model Fail, a partially observable, deterministic, 4-state domain with horizon L = 2 and in which 3 of the states are aliased (appear identical to the agent), which means that the agent's observations are not Markovian and thus that the approximate (MDP) model is incorrect, even asymptotically; and 3) Model Win, a stochastic 4-state MDP with L = 20, where the model that we use can perfectly represent the true MDP.
Dataset Splits No The paper describes the data used (historical trajectories) and the number of trials, but does not specify any training, validation, or test splits for this data for model development or evaluation purposes as typically seen in ML experiments.
Hardware Specification No The paper does not provide any details about the specific hardware used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup No The paper describes the domains used in experiments and evaluation metrics but does not provide specific hyperparameters or system-level training settings for the OPE algorithms themselves.