Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Influential Bandits: Pulling an Arm May Change the Environment
Authors: Ryoma Sato, Shinji Ito
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on both synthetic and real-world datasets demonstrate the presence of inter-arm influence and confirm the superior performance of our method compared to conventional bandit algorithms. |
| Researcher Affiliation | Academia | Ryoma Sato EMAIL National Institute of Informatics Shinji Ito EMAIL The University of Tokyo and RIKEN |
| Pseudocode | Yes | Algorithm 1: Influential LCB |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | We use Movie Lens-32M dataset [12]. We view each user as solving a multi-armed bandit problem, treating each user as an independent instance. A user selects a movie, watches it, and the user s preference is revealed as the loss. Since there are too many movies, and each user selects a movie at most once, we consider a movie genre as an arm rather than an individual movie. Therefore K = #genres = 20. Since a movie can have multiple genres, we randomly select a genre of a movie and assign it to the movie. We define the loss as (5 rating) since the 5-star rating is recorded in this dataset. The history of genre selections and ratings defines log data of a bandit problem. [12] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):19:1 19:19, 2016. |
| Dataset Splits | Yes | We use a leave-one-out validation approach. Specifically, we sort each user s history in chronological order and fit the model using all but the last rating. The final rating is then predicted using the trained model. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not mention any specific software dependencies with version numbers. |
| Experiment Setup | Yes | In contrast, the influential bandits model assumes that l(t) l(1) + Ax(t), and we fit l(1) RK and A RK K to the data. Since A is positive semi-definite, we rewrite the model as l(t) l(1) + BBT x(t) and optimize l(1) and B RK K as parameters. The number of parameters to estimate is thus (K + K2) = 420. We minimize the squared error between the observed values and the predicted values using gradient descent with momentum. |