Minimax Regret for Cascading Bandits

Authors: Daniel Vial, Sujay Sanghavi, Sanjay Shakkottai, R. Srikant

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work provides problem-independent (i.e., gap-free) regret bounds for cascading bandits that strictly improve the state-of-the-art. In the case of unstructured rewards, our results provide the first minimax-optimal regret bounds (up to log terms). Our key insight is that, compared to the standard bandit problem, the reward variance plays an outsized role in the gap-free analysis. (...) Before closing, we conduct experiments on both synthetic and real data. Some details regarding experimental setup are deferred to Appendix C. Code is available in the supplementary material.
Researcher Affiliation Collaboration Daniel Vial UT Austin & UIUC dvial@utexas.edu Sujay Sanghavi UT Austin & Amazon sanghavi@mail.utexas.edu Sanjay Shakkottai UT Austin sanjay.shakkottai@utexas.edu UIUC rsrikant@illinois.edu
Pseudocode Yes Algorithm 1: General UCB algorithm for tabular cascading bandits (...) Algorithm 2: Cascade WOFUL for linear cascading bandits
Open Source Code Yes Code is available in the supplementary material. (...) Complete code to recreate all plots is available in the supplementary material.
Open Datasets Yes Real data. We replicate the first experiment from Zong et al. [2016] on the Movie Lens-1M dataset (grouplens.org/datasets/movielens/1m/), which contains user ratings for L 4000 movies.
Dataset Splits No The paper states, "we divide the ratings into train and test sets," but it does not explicitly mention or detail a validation set or its split percentages/counts.
Hardware Specification Yes The experiments were run over several hours on a laptop, so no significant resources were used.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers (e.g., Python version, PyTorch version, etc.).
Experiment Setup Yes Some details regarding experimental setup are deferred to Appendix C. Code is available in the supplementary material. (...) First, we use their default choices d = 20 and K = 4. Next, we divide the ratings into train and test sets based on the user who provided the rating. From the training data and a rank-d SVD approximation, we learn a feature mapping φ from movies to the probability that a uniformly random training user rated the movie more than three stars.