Robust Tests in Online Decision-Making

Authors: Gi-Soo Kim, Jane P Kim, Hyun-Joon Yang10016-10024

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we propose a modified actor-critic algorithm which is robust to critic misspecification and derive a novel testing procedure for the actor parameters in this case. We conduct experiments on synthetic data and real data and show that our testing procedure appropriately assess the significance of the parameters.
Researcher Affiliation Academia Gi-Soo Kim1, Jane P. Kim2, Hyun-Joon Yang2 1Department of Industrial Engineering & Artificial Intelligence Graduate School, UNIST 2Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine
Pseudocode Yes Algorithm 2: Actor-Improper Critic algorithm
Open Source Code No The paper does not provide any explicit statements about making the source code available or include a link to a code repository.
Open Datasets No The paper mentions generating synthetic data and using the 'Recovery Record Dataset' but does not provide access information (link, DOI, formal citation for public access) for either. For the Recovery Record Dataset, it states: 'The Recovery Record Dataset contained patients adherence behaviors to their therapy for eating disorders (daily meal monitoring) and interactions with their linked clinicians on the app.'
Dataset Splits No The paper discusses synthetic data and a real-world dataset but does not explicitly specify how these datasets were split into training, validation, or test sets. It mentions '30 bootstrap samples' for evaluation in the data application section, but this does not describe a standard data split.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions implementing algorithms and using methods, but it does not specify any software names with version numbers.
Experiment Setup Yes We set N = 2 and d = 4. We generate the context vectors bt,i from a multivariate normal distribution N(0d, Id d) and truncate them to have L2-norm 1. We generate the reward from a model nonlinear in bt,i, rt,i = b T t,iµ max(b T t,1µ, b T t,2µ) + ηt,i where µ = ( 0.577, 0.577, 0.577, 0)T and ηt,i is generated from N(0, 0.012) independently over arms and time. We set the exploration parameter λ in the AC and Proposed algorithms to 0.001. We run the bandit algorithms until time horizon T = 50 with 100 repetitions. We repeated the evaluations on 30 bootstrap samples.