reproducibilityindex.ai

Pessimism for Offline Linear Contextual Bandits using $\ell_p$ Confidence Sets

Authors: Gene Li, Cong Ma, Nati Srebro

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Figure 2, we consider a simple ofﬂine linear contextual bandit in which there is a single state and the feature set is Bd 2; thus the policy learning problem is equivalent to ﬁnding a vector π Sd 1 that maximizes V (π) := π θ . We vary the ofﬂine dataset distribution and the hidden parameter θ . When θ is basis-aligned, we have C1 d C2; when θ is not basis-aligned, the two quantities are on the same order.
Researcher Affiliation	Academia	Gene Li Toyota Technological Institute at Chicago gene@ttic.edu Cong Ma Department of Statistics University of Chicago congm@uchicago.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states in the ethics review: "Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We describe the experiment we ran in sufﬁcient detail; it can be replicated in a few dozen lines of code." However, no specific URL or explicit statement of code availability for their methodology is provided in the main paper or any supplementary section accessible.
Open Datasets	No	The paper mentions using a "fixed historical data" in the offline setting, denoted as D := "{(si, ai, ri)}n i=1", but it does not specify or provide access information for any publicly available or open dataset.
Dataset Splits	No	The paper does not provide specific training/test/validation dataset splits, percentages, or sample counts.
Hardware Specification	Yes	Experiments were ran on a laptop.
Software Dependencies	No	The paper does not provide specific version numbers for any software components, libraries, or programming languages used.
Experiment Setup	Yes	In Figure 2, we consider a simple ofﬂine linear contextual bandit in which there is a single state and the feature set is Bd 2; thus the policy learning problem is equivalent to ﬁnding a vector π Sd 1 that maximizes V (π) := π θ . We vary the ofﬂine dataset distribution and the hidden parameter θ . When θ is basis-aligned, we have C1 d C2; when θ is not basis-aligned, the two quantities are on the same order. (a) φi N(0, QDQ ) and θ = Qe20, where Q is a random rotation matrix and D is a diagonal matrix with entries Dii = i 1/(P i i 1). (b) φi N(0, D) and θ = e20. (c) computed average values for C1 and d C2. The quantity C2 is identical in both plots (a) and (b). For (a), C1 d C2, while for (b), C1 averaged over 100 trials, with 90% conﬁdence intervals.