Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Pessimism for Offline Linear Contextual Bandits using $\ell_p$ Confidence Sets
Authors: Gene Li, Cong Ma, Nati Srebro
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Figure 2, we consider a simple offline linear contextual bandit in which there is a single state and the feature set is Bd 2; thus the policy learning problem is equivalent to finding a vector π Sd 1 that maximizes V (π) := π θ . We vary the offline dataset distribution and the hidden parameter θ . When θ is basis-aligned, we have C1 d C2; when θ is not basis-aligned, the two quantities are on the same order. |
| Researcher Affiliation | Academia | Gene Li Toyota Technological Institute at Chicago EMAIL Cong Ma Department of Statistics University of Chicago EMAIL Nathan Srebro Toyota Technological Institute at Chicago EMAIL |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states in the ethics review: "Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We describe the experiment we ran in sufficient detail; it can be replicated in a few dozen lines of code." However, no specific URL or explicit statement of code availability for their methodology is provided in the main paper or any supplementary section accessible. |
| Open Datasets | No | The paper mentions using a "fixed historical data" in the offline setting, denoted as D := "{(si, ai, ri)}n i=1", but it does not specify or provide access information for any publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits, percentages, or sample counts. |
| Hardware Specification | Yes | Experiments were ran on a laptop. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software components, libraries, or programming languages used. |
| Experiment Setup | Yes | In Figure 2, we consider a simple offline linear contextual bandit in which there is a single state and the feature set is Bd 2; thus the policy learning problem is equivalent to finding a vector π Sd 1 that maximizes V (π) := π θ . We vary the offline dataset distribution and the hidden parameter θ . When θ is basis-aligned, we have C1 d C2; when θ is not basis-aligned, the two quantities are on the same order. (a) φi N(0, QDQ ) and θ = Qe20, where Q is a random rotation matrix and D is a diagonal matrix with entries Dii = i 1/(P i i 1). (b) φi N(0, D) and θ = e20. (c) computed average values for C1 and d C2. The quantity C2 is identical in both plots (a) and (b). For (a), C1 d C2, while for (b), C1 averaged over 100 trials, with 90% confidence intervals. |