Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model
Authors: Gi-Soo Kim, Myunghee Cho Paik
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed and existing algorithms are evaluated via simulation and also applied to Yahoo! news article recommendation log data. Simulation studies show that in most cases, the cumulative reward of the proposed method increases faster than existing methods which assume the same nonstationary reward model. Application to Yahoo! news article recommendation log data shows that the proposed method increases the user click rate compared to the algorithms that assume a stationary reward model. |
| Researcher Affiliation | Academia | 1Department of Statistics, Seoul National University, Seoul, Korea. |
| Pseudocode | Yes | Algorithm 1 Proposed TS algorithm |
| Open Source Code | No | The paper does not provide an explicit statement or link for the source code of the described methodology. |
| Open Datasets | Yes | We present the results of the proposed and existing methods using the R6A dataset provied by Yahoo! Webscope. The dataset is observational log data of user clicks from May 1st, 2009 to May 10th, 2009, which corresponds to 45,811,883 user visits. Yahoo! Webscope. Yahoo! Front Page Today Module User Click Log Dataset, version 1.0. http://webscope. sandbox.yahoo.com. Accessed: 09/01/2019. |
| Dataset Splits | No | The paper mentions using "data of May 1st, 2009 as tuning data to choose the optimal exploration parameter v" which implies a validation step, but it does not specify a clear dataset split (e.g., percentages or predefined splits) for validation/tuning. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We set N=2 or 6 and d=10. We let the first action to be the base action, i.e., b1(t)=0d for all t, and form the other context vectors as bi(t)=[I(i=2)zi(t)T , , I(i=N)zi(t)T ]T , where zi(t) Rd , d =d/(N 1), and zi(t) is generated uniformly at random from the d -dimensional unit sphere. We generate ηi(t) i.i.d. N(0, 0.12) and the rewards from (5), where we set µ = [ 0.55, 0.666, 0.09, 0.232, 0.244, 0.55, 0.666, 0.09, 0.232, 0.244]T and consider four cases for ν(t): (i) ν(t) = 0, (ii) ν(t) = ba (t)(t)T µ, (iii) ν(t)=log(t + 1), (iv) ν(t)=cos(tπ/5000)log(t + 1). We conduct 50 replications in total for each case. We use data of May 1st, 2009 as tuning data to choose the optimal exploration parameter v for the TS algorithm and the proposed algorithm, respectively. Then we conduct main analysis on data from May 2nd to May 10th, 2009. We fix the value of T to T = 1900000 a priori, and conduct the evaluation algorithm for 10 times on the same data for each policy. |