Off-Policy Risk Assessment in Contextual Bandits

Authors: Audrey Huang, Liu Leqi, Zachary Lipton, Kamyar Azizzadenesheli

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we give empirical evidence for the effectiveness of the doubly robust (DR) CDF and risk estimates, in comparison to the importance sampling (IS), weighted importance sampling (WIS), and direct method (DM) estimates. ... Finally, we present experiments that demonstrate the practical applicability our estimators.
Researcher Affiliation Academia Audrey Huang Department of Computer Science University of Illinois Urbana-Champaign Urbana, IL 61801 audreyh5@illinois.edu; Liu Leqi Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 leqil@cs.cmu.edu; Zachary C. Lipton Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 zlipton@cmu.edu; Kamyar Azizzadenesheli Department of Computer Science Purdue University West Lafayette, IN 47907 kamyar@purdue.edu
Pseudocode Yes Algorithm 1: Off-Policy Risk Assessment (OPRA)
Open Source Code No The paper does not contain a statement indicating that the source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes We apply this process to the Page Blocks and Opt Digits datasets [18]... [18] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
Dataset Splits No The paper states 'the dataset is divided into two splits, with each of the two splits used to calculate G via regression, which is then used with the other split to calculate the estimator', but it does not specify explicit training, validation, and test splits with percentages or sample counts for the main evaluation setup.
Hardware Specification No The paper does not specify any hardware details such as GPU or CPU models, memory, or specific computing environments used for running experiments.
Software Dependencies Yes We use OPRA to evaluate the risks of a target policy for diabetes treatment in the Simglucose simulator [59]. ... [59] Jinyu Xie. Simglucose v0. 2.1 (2018).
Experiment Setup Yes The behavior policy is defined as β = απ + (1 α)πUNIF, where πUNIF is a uniform policy over the actions, for some α (0, 1]. We apply this process to the Page Blocks and Opt Digits datasets [18], which have dimensions d and actions k using α = 0.1 (Figure 1).