reproducibilityindex.ai

Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

Authors: Giulia Romano, Andrea Agostini, Francesco Trovò, Nicola Gatti, Marcello Restelli

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.
Researcher Affiliation	Academia	Politecnico di Milano, Piazza Leonardo da Vinci 32, I-20133, Milan, Italy {giulia.romano, francesco1.trovo, nicola.gatti, marcello.restelli}@polimi.it, andrea1.agostini@mail.polimi.it
Pseudocode	Yes	Algorithm 1 TP-UCB-FR ... Algorithm 2 TP-UCB-EW
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	Spotify Setting. We apply the TP-MAB approach to solve the user recommendation problem presented in Example 1, using a dataset by Spotify [Brost et al., 2019].
Dataset Splits	No	The paper does not specify training, validation, or test dataset splits. It describes experimental settings and how data is sampled (e.g., for Spotify setting, 'reward realizations xi t for the first N = 20 songs is sampled from the listening sessions of that playlist').
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud computing instances with their specs) used for running the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers.
Experiment Setup	Yes	Setting #1. At ﬁrst, we evaluate the inﬂuence of the parameter α. We model K = 10 arms, whose maximum reward is s.t. R i = 100i. The reward is collected over τmax = 100 rounds, the smoothness parameter is α = 20, and the aggregated rewards are s.t. Zi t,k R i α U([0, 1]), for each k [α]. We run the algorithms over a time horizon of T = 105 and average the results over 50 independent runs. ... Spotify Setting. We select the K = 6 most played playlist as the arms to be recommended, and each time a playlist i is selected, the corresponding reward realizations xi t for the ﬁrst N = 20 songs is sampled from the listening sessions of that playlist contained in the dataset. ... the maximum delay is τmax = 4N = 80, and the smoothness parameter is α = 20. More details on the setting and the distributions of the reward for each playlist are provided in Appendix C. We average the results over 50 independent runs.