Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Contextual Thompson Sampling via Generation of Missing Data

Authors: Kelly W Zhang, Tianhui Cai, Hongseok Namkoong, Daniel Russo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate a practical implementation of our framework in Sections 4 and 6. 6 Experiments Problem setting. Throughout, T = 500, \|A\| = 10,1 outcomes Y are binary, R(y) = y, and Z has separate components Z(a) R2 for each action. Our SYNTHETIC setting uses a Bayesian logistic regression data-generating process with contexts X R5. Our SEMI-SYNTHETIC setting mimics a cold-start, news recommendation setting using the MIcrosoft News Dataset [Wu et al., 2020]; Z(a) consists of article headline text, contexts X R5 are user features, and Y {0, 1} represents whether user click on a recommendation. See Appendix B.1 for details. Results. As seen in Figure 6, TS-Gen outperforms other algorithms in both the SYNTHETIC and SEMI-SYNTHETIC settings. TS-Gen s superior performance compared to other algorithms that use the same pθ model (Greedy, EPSILON-GREEDY, TS-NEURAL-LINEAR) validates the benefit of our generative approach to uncertainty quantification and decision-making. We conjecture TS-Gen s advantage compared to Lin UCB and TS-Linear is attributable to our pretraining procedure and the Figure 6: Cumulative regret averaged over 500 bandit tasks. Regret is against the best fitting policy in Π (logistic for synthetic and MLP-based for semisynthetic). TS-Gen outperforms methods that use the same pθ model (Greedy, ϵ-Greedy, TS-Neural-Linear). Error bars (barely visible) denote 1 s.e.
Researcher Affiliation	Academia	Kelly W. Zhang Department of Mathematics Imperial College London EMAIL Tiffany (Tianhui) Cai Department of Statistics Columbia University EMAIL Hongseok Namkoong Decision, Risk, and Operations Columbia Business School EMAIL Daniel Russo Decision, Risk, and Operations Columbia Business School EMAIL
Pseudocode	Yes	Algorithm 1 Generative Thompson Sampling Algorithm 2 Offline training of a sequence model Algorithm 3 Posterior sampling via autoregressive generation
Open Source Code	Yes	We also provide code in the supplementary materials and at https://github.com/ namkoong-lab/ts-gen. We include code for our experiments in the supplementary materials and also at https://github.com/namkoong-lab/ts-gen.
Open Datasets	Yes	Our SEMI-SYNTHETIC setting mimics a cold-start, news recommendation setting using the MIcrosoft News Dataset [Wu et al., 2020]; Z(a) consists of article headline text, contexts X R5 are user features, and Y {0, 1} represents whether user click on a recommendation. See Appendix B.1 for details.
Dataset Splits	Yes	The dataset is split into training and validation sets where 10k actions are in each set. The training set is used for training pθ via gradient descent for 100 epochs, with loss from display (6); Note for approximating the distribution of Xt, we use the empirical distribution of 1000 contexts X s from the training set (no gradient descent training). In each training batch, we use bootstrap resampling, specifically, Algorithm 4. The validation set is for choosing best hyperparameters and training epoch. We optimize weights in pθ with the Adam W optimizer. We try learning rates {0.1, 0.01, 0.001} and choose the learning rate with the lowest validation loss, which is 0.01. We set weight decay to 0.01. The batch size is 500 actions a per batch. For offline training of pθ, we sample independent task action datasets {Z(a), X1:N (a), Y (a) 1:N (a)}. For Z(a) s use 104k headlines from the MIND dataset [Wu et al., 2020]; 20k are used for the training set, 10k are used for validation, and 74k are used for bandit evaluation.
Hardware Specification	Yes	We use a CPU cluster at Columbia GSB and request at most 50GB of memory per job. The semisynthetic data generating process also involves evaluating two pre-trained text classifiers, and then caching their outputs and/or embeddings (Distil BERT embeddings + text classifier outputs in the semisynthetic setting); this was done once on a single GPU at negligible time cost (several minutes). For online decision-making, we also use a CPU cluster at Columbia GSB and for each job we request at most 10GB of memory.
Software Dependencies	No	The paper mentions software like scikit-learn, Distil BERT, and other text classifiers but does not provide specific version numbers for these components, which are required for a 'Yes' answer to this question.
Experiment Setup	Yes	Problem setting. Throughout, T = 500, \|A\| = 10,1 outcomes Y are binary, R(y) = y, and Z has separate components Z(a) R2 for each action. Our SYNTHETIC setting uses a Bayesian logistic regression data-generating process with contexts X R5. Our SEMI-SYNTHETIC setting mimics a cold-start, news recommendation setting using the MIcrosoft News Dataset [Wu et al., 2020]; Z(a) consists of article headline text, contexts X R5 are user features, and Y {0, 1} represents whether user click on a recommendation. See Appendix B.1 for details. For offline training of pθ, we sample 20k independent task action datasets {Z(a), X1:N (a), Y (a) 1:N (a)} according to the data generating process from Appendix B.1.1; Specifically we use N (a) = 1000 for all a. This dataset is split into training and validation sets where 10k actions are in each set. The training set is used for training pθ via gradient descent for 100 epochs, with loss from display (6); Note for approximating the distribution of Xt, we use the empirical distribution of 1000 contexts X s from the training set (no gradient descent training). In each training batch, we use bootstrap resampling, specifically, Algorithm 4. The validation set is for choosing best hyperparameters and training epoch. We optimize weights in pθ with the Adam W optimizer. We try learning rates {0.1, 0.01, 0.001} and choose the learning rate with the lowest validation loss, which is 0.01. We set weight decay to 0.01. The batch size is 500 actions a per batch. For MLP-based policies, we use the default MLP classifier implementation (including hyperparameters), also from scikit-learn [Pedregosa et al., 2011]. This is an MLP with one hidden layer of width 100, with Re LU activation, trained with Adam optimizer, with initial learning rate 0.001, and batch size 200. There is no early stopping or additional validation split.