Pessimism for Offline Linear Contextual Bandits using $\ell_p$ Confidence Sets

Authors: Gene Li, Cong Ma, Nati Srebro

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Figure 2, we consider a simple offline linear contextual bandit in which there is a single state and the feature set is Bd 2; thus the policy learning problem is equivalent to finding a vector π Sd 1 that maximizes V (π) := π θ . We vary the offline dataset distribution and the hidden parameter θ . When θ is basis-aligned, we have C1 d C2; when θ is not basis-aligned, the two quantities are on the same order.
Researcher Affiliation Academia Gene Li Toyota Technological Institute at Chicago gene@ttic.edu Cong Ma Department of Statistics University of Chicago congm@uchicago.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states in the ethics review: "Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We describe the experiment we ran in sufficient detail; it can be replicated in a few dozen lines of code." However, no specific URL or explicit statement of code availability for their methodology is provided in the main paper or any supplementary section accessible.
Open Datasets No The paper mentions using a "fixed historical data" in the offline setting, denoted as D := "{(si, ai, ri)}n i=1", but it does not specify or provide access information for any publicly available or open dataset.
Dataset Splits No The paper does not provide specific training/test/validation dataset splits, percentages, or sample counts.
Hardware Specification Yes Experiments were ran on a laptop.
Software Dependencies No The paper does not provide specific version numbers for any software components, libraries, or programming languages used.
Experiment Setup Yes In Figure 2, we consider a simple offline linear contextual bandit in which there is a single state and the feature set is Bd 2; thus the policy learning problem is equivalent to finding a vector π Sd 1 that maximizes V (π) := π θ . We vary the offline dataset distribution and the hidden parameter θ . When θ is basis-aligned, we have C1 d C2; when θ is not basis-aligned, the two quantities are on the same order. (a) φi N(0, QDQ ) and θ = Qe20, where Q is a random rotation matrix and D is a diagonal matrix with entries Dii = i 1/(P i i 1). (b) φi N(0, D) and θ = e20. (c) computed average values for C1 and d C2. The quantity C2 is identical in both plots (a) and (b). For (a), C1 d C2, while for (b), C1 averaged over 100 trials, with 90% confidence intervals.