Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MultiScale Contextual Bandits for Long Term Objectives

Authors: Richa Rastogi, Yuta Saito, Thorsten Joachims

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on three scenarios related to conversational systems (Anthropic Helpful Assistant [4] and a new simulator), and a conventional recommendation system (Kuai Rand video streaming benchmark) for two and three timescale levels.
Researcher Affiliation	Academia	Richa Rastogi Yuta Saito Thorsten Joachims Department of Computer Science Cornell University
Pseudocode	Yes	We can now summarize our approach to learning a nested set of policies across multiple levels in Algorithm 1. The algorithm uses off-policy contextual bandits [10, 38, 33] at each level. Algorithm 1 is limited to two levels for conciseness of notation, but the full recursive procedure for an arbitrary number of levels is given in Appendix D.2. In the experiments, we will explore policy spaces with two and three levels.
Open Source Code	Yes	The simulator code can be found at https://github.com/Rich Rast/mspl.
Open Datasets	Yes	We conduct experiments on three scenarios related to conversational systems (Anthropic Helpful Assistant [4] and a new simulator), and a conventional recommendation system (Kuai Rand video streaming benchmark). ... We use the Kuai Rand dataset [12] to simulate two levels of short and long term feedback.
Dataset Splits	Yes	We use a training dataset of size 14,265 and test randomly selected 1,771 users.
Hardware Specification	Yes	We use NVIDIA RTX A6000-48GB GPUs for experiments in conversational systems and 24GB NVIDIA Ge Force RTX 3090 for experiments in conventional recommender system.
Software Dependencies	No	No specific software dependencies with version numbers (like Python, PyTorch, or library versions) are mentioned. The paper mentions using Llama-2-7b-chat [39] and huggingface reward models, but not the software versions of the frameworks or libraries used.
Experiment Setup	Yes	For πL2(.\|x L2), we use a 5 layer neural network with hidden dimension 256. We train using Adam W optimizer, with a batch size of 256, learning rate 1e-3, weight decay 1e-1 for 1000 epochs. We use the inverse temperature parameter β = 0.8 in Eq. (11).