Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

Authors: Giulia Romano, Andrea Agostini, Francesco Trovò, Nicola Gatti, Marcello Restelli

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.
Researcher Affiliation Academia Politecnico di Milano, Piazza Leonardo da Vinci 32, I-20133, Milan, Italy {giulia.romano, francesco1.trovo, nicola.gatti, marcello.restelli}@polimi.it, andrea1.agostini@mail.polimi.it
Pseudocode Yes Algorithm 1 TP-UCB-FR ... Algorithm 2 TP-UCB-EW
Open Source Code No The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes Spotify Setting. We apply the TP-MAB approach to solve the user recommendation problem presented in Example 1, using a dataset by Spotify [Brost et al., 2019].
Dataset Splits No The paper does not specify training, validation, or test dataset splits. It describes experimental settings and how data is sampled (e.g., for Spotify setting, 'reward realizations xi t for the first N = 20 songs is sampled from the listening sessions of that playlist').
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud computing instances with their specs) used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers.
Experiment Setup Yes Setting #1. At first, we evaluate the influence of the parameter α. We model K = 10 arms, whose maximum reward is s.t. R i = 100i. The reward is collected over τmax = 100 rounds, the smoothness parameter is α = 20, and the aggregated rewards are s.t. Zi t,k R i α U([0, 1]), for each k [α]. We run the algorithms over a time horizon of T = 105 and average the results over 50 independent runs. ... Spotify Setting. We select the K = 6 most played playlist as the arms to be recommended, and each time a playlist i is selected, the corresponding reward realizations xi t for the first N = 20 songs is sampled from the listening sessions of that playlist contained in the dataset. ... the maximum delay is τmax = 4N = 80, and the smoothness parameter is α = 20. More details on the setting and the distributions of the reward for each playlist are provided in Appendix C. We average the results over 50 independent runs.