Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MultiScale Contextual Bandits for Long Term Objectives
Authors: Richa Rastogi, Yuta Saito, Thorsten Joachims
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on three scenarios related to conversational systems (Anthropic Helpful Assistant [4] and a new simulator), and a conventional recommendation system (Kuai Rand video streaming benchmark) for two and three timescale levels. |
| Researcher Affiliation | Academia | Richa Rastogi Yuta Saito Thorsten Joachims Department of Computer Science Cornell University |
| Pseudocode | Yes | We can now summarize our approach to learning a nested set of policies across multiple levels in Algorithm 1. The algorithm uses off-policy contextual bandits [10, 38, 33] at each level. Algorithm 1 is limited to two levels for conciseness of notation, but the full recursive procedure for an arbitrary number of levels is given in Appendix D.2. In the experiments, we will explore policy spaces with two and three levels. |
| Open Source Code | Yes | The simulator code can be found at https://github.com/Rich Rast/mspl. |
| Open Datasets | Yes | We conduct experiments on three scenarios related to conversational systems (Anthropic Helpful Assistant [4] and a new simulator), and a conventional recommendation system (Kuai Rand video streaming benchmark). ... We use the Kuai Rand dataset [12] to simulate two levels of short and long term feedback. |
| Dataset Splits | Yes | We use a training dataset of size 14,265 and test randomly selected 1,771 users. |
| Hardware Specification | Yes | We use NVIDIA RTX A6000-48GB GPUs for experiments in conversational systems and 24GB NVIDIA Ge Force RTX 3090 for experiments in conventional recommender system. |
| Software Dependencies | No | No specific software dependencies with version numbers (like Python, PyTorch, or library versions) are mentioned. The paper mentions using Llama-2-7b-chat [39] and huggingface reward models, but not the software versions of the frameworks or libraries used. |
| Experiment Setup | Yes | For πL2(.|x L2), we use a 5 layer neural network with hidden dimension 256. We train using Adam W optimizer, with a batch size of 256, learning rate 1e-3, weight decay 1e-1 for 1000 epochs. We use the inverse temperature parameter β = 0.8 in Eq. (11). |