Pessimism for Offline Linear Contextual Bandits using $\ell_p$ Confidence Sets
Authors: Gene Li, Cong Ma, Nati Srebro
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Figure 2, we consider a simple offline linear contextual bandit in which there is a single state and the feature set is Bd 2; thus the policy learning problem is equivalent to finding a vector π Sd 1 that maximizes V (π) := π θ . We vary the offline dataset distribution and the hidden parameter θ . When θ is basis-aligned, we have C1 d C2; when θ is not basis-aligned, the two quantities are on the same order. |
| Researcher Affiliation | Academia | Gene Li Toyota Technological Institute at Chicago gene@ttic.edu Cong Ma Department of Statistics University of Chicago congm@uchicago.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states in the ethics review: "Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We describe the experiment we ran in sufficient detail; it can be replicated in a few dozen lines of code." However, no specific URL or explicit statement of code availability for their methodology is provided in the main paper or any supplementary section accessible. |
| Open Datasets | No | The paper mentions using a "fixed historical data" in the offline setting, denoted as D := "{(si, ai, ri)}n i=1", but it does not specify or provide access information for any publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits, percentages, or sample counts. |
| Hardware Specification | Yes | Experiments were ran on a laptop. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software components, libraries, or programming languages used. |
| Experiment Setup | Yes | In Figure 2, we consider a simple offline linear contextual bandit in which there is a single state and the feature set is Bd 2; thus the policy learning problem is equivalent to finding a vector π Sd 1 that maximizes V (π) := π θ . We vary the offline dataset distribution and the hidden parameter θ . When θ is basis-aligned, we have C1 d C2; when θ is not basis-aligned, the two quantities are on the same order. (a) φi N(0, QDQ ) and θ = Qe20, where Q is a random rotation matrix and D is a diagonal matrix with entries Dii = i 1/(P i i 1). (b) φi N(0, D) and θ = e20. (c) computed average values for C1 and d C2. The quantity C2 is identical in both plots (a) and (b). For (a), C1 d C2, while for (b), C1 averaged over 100 trials, with 90% confidence intervals. |