High-Confidence Off-Policy Evaluation

Authors: Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments we show how our algorithm can be applied to a digital marketing problem where confidence bounds are necessary to motivate the potentially risky gamble of executing a new policy. Although the off-policy evaluation problem has been solved efficiently in the multi-arm bandit case (Li et al. 2011), it is still an open question for sequential decision problems. Existing methods for estimating the performance of the evaluation policy using trajectories from behavior policies do not provide confidence bounds (Maei and Sutton 2010; Liu, Mahadevan, and Liu 2012; Mandel et al. 2014). [...] Experiments Mountain Car We used the mountain car data from Fig. 1 to compare the lower bounds found when using Thm. 1, CH, MPe B, and AM. The results are provided in Table 1.
Researcher Affiliation Collaboration Philip S. Thomas1,2 Georgios Theocharous1 Mohammad Ghavamzadeh1,3 1Adobe Research, 2University of Massachusetts Amherst, 3INRIA Lille
Pseudocode No The paper does not include a clearly labeled pseudocode block or algorithm block. Theorem 1 presents a mathematical formula and its proof, not a step-by-step algorithm.
Open Source Code No The paper does not include any statement about making its source code openly available or provide a link to a code repository.
Open Datasets No For our second case study we used real data, captured with permission from the website of a Fortune 50 company that receives hundreds of thousands of visitors per day and which uses Adobe Target, to train a simulator using a proprietary in-house system identification tool at Adobe. The simulator produces a vector of 31 real-valued features that provide a compressed representation of all of the available information about a user.
Dataset Splits Yes Thm. 1 requires the thresholds, ci, to be fixed i.e., they should not be computed using realizations of any Xi. So, we partition the data set, D, into two sets, Dpre and Dpost. Dpre is used to estimate the optimal threshold, c, and Dpost is used to compute the lower bound (the RHS of (2)). [...] From our preliminary experiments we found that using 1/20 of the samples in Dpre and the remaining 19/20 in Dpost works well.
Hardware Specification No The paper states: 'We used the mountain car data from Fig. 1 to compare the lower bounds...' and 'to train a simulator using a proprietary in-house system identification tool at Adobe.' However, it does not provide specific hardware details such as CPU/GPU models or memory specifications.
Software Dependencies No The paper does not provide specific software names with version numbers that would be required to reproduce the experiments. It mentions using 'Adobe Marketing Cloud' and 'Adobe Target' but without version details.
Experiment Setup Yes We selected T = 20 and γ = 1. This is a particularly challenging problem because the reward signal is sparse if each action is selected with probability 0.5 always, only about 0.38% of the transitions are rewarding, since users usually do not click on the advertisements. [...] From our preliminary experiments we found that using 1/20 of the samples in Dpre and the remaining 19/20 in Dpost works well.