Stochastic Bandits with Context Distributions

Authors: Johannes Kirschner, Andreas Krause

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed method on a synthetic example as well as on two benchmarks that we construct from real-world data. Our focus is on understanding the effect of the sample size L used to define the context set Ψl t. We compare three different observational modes, with decreasing amount of information available to the learner. First, in the exact setting, we allow the algorithm to observe the context realization before choosing an action, akin to the usual contextual bandit setting. Note that this variant possibly obtains negative reward on the regret objective (1), because x t is computed to maximize the expected reward over the context distribution independent of ct. Second, in the observed setting, decisions are based on the context distribution, but the regression is based on the exact context realization. Last, only the context distribution is used for the hidden setting. We evaluate the effect of the sample sizes L = 10, 100 and compare to the variant that uses that exact expectation of the features. As common practice, we treat the confidence parameter βT as tuning parameter that we choose to minimize the regret after T = 1000 steps. Below we provide details on the experimental setup and the evaluation is shown in Figure 1. In all experiments, the exact version significantly outperforms the distributional variants or even achieves negative regret as anticipated. Consistent with our theory, observing the exact context after the action choice improves performance compared to the unobserved variant. The sampled-based algorithm is competitive with the expected features already for L = 100 samples.
Researcher Affiliation Academia Johannes Kirschner Department of Computer Science ETH Zurich jkirschner@inf.ethz.ch Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch
Pseudocode Yes Algorithm 1 UCB for linear stochastic bandits with context distributions; Algorithm 2 (Appendix B); Algorithm 3 (Appendix C).
Open Source Code No The paper does not include an unambiguous statement or a direct link to a source-code repository for the methodology described.
Open Datasets Yes Movielens Data Using matrix factorization we construct 6-dimensional features for user ratings of movies in the movielens-1m dataset (Harper and Konstan, 2016).; Crop Yield Data We use a wheat yield dataset that was systematically collected by the Agroscope institute in Switzerland over 15 years on 10 different sites.
Dataset Splits No The paper describes the datasets used but does not provide specific train/validation/test dataset splits, percentages, or cross-validation setup.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes As a simple synthetic benchmark we set the reward function to f(x, c) = P5 i=1(xi ci)2, where both actions and context are vectors in R5. ... The action set consists of k = 100 elements that we sample at the beginning of each trial from a standard Gaussian distribution. For the context distribution, we first sample a random element mt R5, again from a multivariate normal distribution, and then set µt = N(mt, 1). Observation noise is Gaussian with standard deviation 0.1. ... We evaluate the effect of the sample sizes L = 10, 100 and compare to the variant that uses that exact expectation of the features. As common practice, we treat the confidence parameter βT as tuning parameter that we choose to minimize the regret after T = 1000 steps.