Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts
Authors: Giulia Romano, Andrea Agostini, Francesco Trovò, Nicola Gatti, Marcello Restelli
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem. |
| Researcher Affiliation | Academia | Politecnico di Milano, Piazza Leonardo da Vinci 32, I-20133, Milan, Italy {giulia.romano, francesco1.trovo, nicola.gatti, marcello.restelli}@polimi.it, andrea1.agostini@mail.polimi.it |
| Pseudocode | Yes | Algorithm 1 TP-UCB-FR ... Algorithm 2 TP-UCB-EW |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Spotify Setting. We apply the TP-MAB approach to solve the user recommendation problem presented in Example 1, using a dataset by Spotify [Brost et al., 2019]. |
| Dataset Splits | No | The paper does not specify training, validation, or test dataset splits. It describes experimental settings and how data is sampled (e.g., for Spotify setting, 'reward realizations xi t for the first N = 20 songs is sampled from the listening sessions of that playlist'). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud computing instances with their specs) used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | Setting #1. At first, we evaluate the influence of the parameter α. We model K = 10 arms, whose maximum reward is s.t. R i = 100i. The reward is collected over τmax = 100 rounds, the smoothness parameter is α = 20, and the aggregated rewards are s.t. Zi t,k R i α U([0, 1]), for each k [α]. We run the algorithms over a time horizon of T = 105 and average the results over 50 independent runs. ... Spotify Setting. We select the K = 6 most played playlist as the arms to be recommended, and each time a playlist i is selected, the corresponding reward realizations xi t for the first N = 20 songs is sampled from the listening sessions of that playlist contained in the dataset. ... the maximum delay is τmax = 4N = 80, and the smoothness parameter is α = 20. More details on the setting and the distributions of the reward for each playlist are provided in Appendix C. We average the results over 50 independent runs. |