reproducibilityindex.ai

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model

Authors: Gi-Soo Kim, Myunghee Cho Paik

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed and existing algorithms are evaluated via simulation and also applied to Yahoo! news article recommendation log data. Simulation studies show that in most cases, the cumulative reward of the proposed method increases faster than existing methods which assume the same nonstationary reward model. Application to Yahoo! news article recommendation log data shows that the proposed method increases the user click rate compared to the algorithms that assume a stationary reward model.
Researcher Affiliation	Academia	1Department of Statistics, Seoul National University, Seoul, Korea.
Pseudocode	Yes	Algorithm 1 Proposed TS algorithm
Open Source Code	No	The paper does not provide an explicit statement or link for the source code of the described methodology.
Open Datasets	Yes	We present the results of the proposed and existing methods using the R6A dataset provied by Yahoo! Webscope. The dataset is observational log data of user clicks from May 1st, 2009 to May 10th, 2009, which corresponds to 45,811,883 user visits. Yahoo! Webscope. Yahoo! Front Page Today Module User Click Log Dataset, version 1.0. http://webscope. sandbox.yahoo.com. Accessed: 09/01/2019.
Dataset Splits	No	The paper mentions using "data of May 1st, 2009 as tuning data to choose the optimal exploration parameter v" which implies a validation step, but it does not specify a clear dataset split (e.g., percentages or predefined splits) for validation/tuning.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., CPU, GPU models, memory) used for the experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	We set N=2 or 6 and d=10. We let the ﬁrst action to be the base action, i.e., b1(t)=0d for all t, and form the other context vectors as bi(t)=[I(i=2)zi(t)T , , I(i=N)zi(t)T ]T , where zi(t) Rd , d =d/(N 1), and zi(t) is generated uniformly at random from the d -dimensional unit sphere. We generate ηi(t) i.i.d. N(0, 0.12) and the rewards from (5), where we set µ = [ 0.55, 0.666, 0.09, 0.232, 0.244, 0.55, 0.666, 0.09, 0.232, 0.244]T and consider four cases for ν(t): (i) ν(t) = 0, (ii) ν(t) = ba (t)(t)T µ, (iii) ν(t)=log(t + 1), (iv) ν(t)=cos(tπ/5000)log(t + 1). We conduct 50 replications in total for each case. We use data of May 1st, 2009 as tuning data to choose the optimal exploration parameter v for the TS algorithm and the proposed algorithm, respectively. Then we conduct main analysis on data from May 2nd to May 10th, 2009. We ﬁx the value of T to T = 1900000 a priori, and conduct the evaluation algorithm for 10 times on the same data for each policy.