High-Confidence Off-Policy Evaluation
Authors: Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments we show how our algorithm can be applied to a digital marketing problem where confidence bounds are necessary to motivate the potentially risky gamble of executing a new policy. Although the off-policy evaluation problem has been solved efficiently in the multi-arm bandit case (Li et al. 2011), it is still an open question for sequential decision problems. Existing methods for estimating the performance of the evaluation policy using trajectories from behavior policies do not provide confidence bounds (Maei and Sutton 2010; Liu, Mahadevan, and Liu 2012; Mandel et al. 2014). [...] Experiments Mountain Car We used the mountain car data from Fig. 1 to compare the lower bounds found when using Thm. 1, CH, MPe B, and AM. The results are provided in Table 1. |
| Researcher Affiliation | Collaboration | Philip S. Thomas1,2 Georgios Theocharous1 Mohammad Ghavamzadeh1,3 1Adobe Research, 2University of Massachusetts Amherst, 3INRIA Lille |
| Pseudocode | No | The paper does not include a clearly labeled pseudocode block or algorithm block. Theorem 1 presents a mathematical formula and its proof, not a step-by-step algorithm. |
| Open Source Code | No | The paper does not include any statement about making its source code openly available or provide a link to a code repository. |
| Open Datasets | No | For our second case study we used real data, captured with permission from the website of a Fortune 50 company that receives hundreds of thousands of visitors per day and which uses Adobe Target, to train a simulator using a proprietary in-house system identification tool at Adobe. The simulator produces a vector of 31 real-valued features that provide a compressed representation of all of the available information about a user. |
| Dataset Splits | Yes | Thm. 1 requires the thresholds, ci, to be fixed i.e., they should not be computed using realizations of any Xi. So, we partition the data set, D, into two sets, Dpre and Dpost. Dpre is used to estimate the optimal threshold, c, and Dpost is used to compute the lower bound (the RHS of (2)). [...] From our preliminary experiments we found that using 1/20 of the samples in Dpre and the remaining 19/20 in Dpost works well. |
| Hardware Specification | No | The paper states: 'We used the mountain car data from Fig. 1 to compare the lower bounds...' and 'to train a simulator using a proprietary in-house system identification tool at Adobe.' However, it does not provide specific hardware details such as CPU/GPU models or memory specifications. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers that would be required to reproduce the experiments. It mentions using 'Adobe Marketing Cloud' and 'Adobe Target' but without version details. |
| Experiment Setup | Yes | We selected T = 20 and γ = 1. This is a particularly challenging problem because the reward signal is sparse if each action is selected with probability 0.5 always, only about 0.38% of the transitions are rewarding, since users usually do not click on the advertisements. [...] From our preliminary experiments we found that using 1/20 of the samples in Dpre and the remaining 19/20 in Dpost works well. |