Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MultiScale Contextual Bandits for Long Term Objectives

Authors: Richa Rastogi, Yuta Saito, Thorsten Joachims

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on three scenarios related to conversational systems (Anthropic Helpful Assistant [4] and a new simulator), and a conventional recommendation system (Kuai Rand video streaming benchmark) for two and three timescale levels.
Researcher Affiliation Academia Richa Rastogi Yuta Saito Thorsten Joachims Department of Computer Science Cornell University
Pseudocode Yes We can now summarize our approach to learning a nested set of policies across multiple levels in Algorithm 1. The algorithm uses off-policy contextual bandits [10, 38, 33] at each level. Algorithm 1 is limited to two levels for conciseness of notation, but the full recursive procedure for an arbitrary number of levels is given in Appendix D.2. In the experiments, we will explore policy spaces with two and three levels.
Open Source Code Yes The simulator code can be found at https://github.com/Rich Rast/mspl.
Open Datasets Yes We conduct experiments on three scenarios related to conversational systems (Anthropic Helpful Assistant [4] and a new simulator), and a conventional recommendation system (Kuai Rand video streaming benchmark). ... We use the Kuai Rand dataset [12] to simulate two levels of short and long term feedback.
Dataset Splits Yes We use a training dataset of size 14,265 and test randomly selected 1,771 users.
Hardware Specification Yes We use NVIDIA RTX A6000-48GB GPUs for experiments in conversational systems and 24GB NVIDIA Ge Force RTX 3090 for experiments in conventional recommender system.
Software Dependencies No No specific software dependencies with version numbers (like Python, PyTorch, or library versions) are mentioned. The paper mentions using Llama-2-7b-chat [39] and huggingface reward models, but not the software versions of the frameworks or libraries used.
Experiment Setup Yes For πL2(.|x L2), we use a 5 layer neural network with hidden dimension 256. We train using Adam W optimizer, with a batch size of 256, learning rate 1e-3, weight decay 1e-1 for 1000 epochs. We use the inverse temperature parameter β = 0.8 in Eq. (11).