Off-Policy Evaluation for Large Action Spaces via Embeddings
Authors: Yuta Saito, Thorsten Joachims
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions. and 4. Empirical Evaluation We first evaluate MIPS on synthetic data to identify the situations where it enables a more accurate OPE. Second, we validate real-world applicability on data from an online fashion store. |
| Researcher Affiliation | Academia | Yuta Saito 1 Thorsten Joachims 1 1Department of Computer Science, Cornell University, Ithaca, NY, USA. Correspondence to: Yuta Saito <ys552@cornell.edu>, Thorsten Joachims <tj@cs.cornell.edu>. |
| Pseudocode | Yes | Algorithm 1 An Experimental Procedure to Evaluate an OPE Estimator with Real-World Bandit Data (Appendix D.3) |
| Open Source Code | Yes | Our experiment implementation is available at https://github.com/usaito/icml2022-mips. |
| Open Datasets | Yes | We use the Open Bandit Dataset (OBD)6 (Saito et al., 2020), a publicly available logged bandit dataset collected on a large-scale fashion e-commerce platform. |
| Dataset Splits | No | The paper describes generating synthetic data and using bootstrap sampling for real-world data evaluation, and cross-fitting for internal model estimation, but does not specify a fixed training, validation, and test dataset split for the overall OPE evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper mentions software like 'scikit-learn' and 'Categorical Naive Bayes' but does not specify their version numbers. |
| Experiment Setup | Yes | To summarize, we first sample context x and define the expected reward q(x, e) as in Eq. (5). We then sample discrete action a from π0 based on Eq. (6). Given action a, we sample categorical action embedding e based on Eq. (4). Finally, we sample the reward from a normal distribution with mean q(x, e) and standard deviation σ = 2.5. Iterating this procedure n times generates logged data D with n independent copies of (x, a, e, r). and In the main text, we use β = 1, and additional results for other values of β can be found in Appendix D.2. and In the main text, we set ϵ = 0.05, which produces a near-optimal and near-deterministic target policy. |