Robust Tests in Online Decision-Making
Authors: Gi-Soo Kim, Jane P Kim, Hyun-Joon Yang10016-10024
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose a modified actor-critic algorithm which is robust to critic misspecification and derive a novel testing procedure for the actor parameters in this case. We conduct experiments on synthetic data and real data and show that our testing procedure appropriately assess the significance of the parameters. |
| Researcher Affiliation | Academia | Gi-Soo Kim1, Jane P. Kim2, Hyun-Joon Yang2 1Department of Industrial Engineering & Artificial Intelligence Graduate School, UNIST 2Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine |
| Pseudocode | Yes | Algorithm 2: Actor-Improper Critic algorithm |
| Open Source Code | No | The paper does not provide any explicit statements about making the source code available or include a link to a code repository. |
| Open Datasets | No | The paper mentions generating synthetic data and using the 'Recovery Record Dataset' but does not provide access information (link, DOI, formal citation for public access) for either. For the Recovery Record Dataset, it states: 'The Recovery Record Dataset contained patients adherence behaviors to their therapy for eating disorders (daily meal monitoring) and interactions with their linked clinicians on the app.' |
| Dataset Splits | No | The paper discusses synthetic data and a real-world dataset but does not explicitly specify how these datasets were split into training, validation, or test sets. It mentions '30 bootstrap samples' for evaluation in the data application section, but this does not describe a standard data split. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions implementing algorithms and using methods, but it does not specify any software names with version numbers. |
| Experiment Setup | Yes | We set N = 2 and d = 4. We generate the context vectors bt,i from a multivariate normal distribution N(0d, Id d) and truncate them to have L2-norm 1. We generate the reward from a model nonlinear in bt,i, rt,i = b T t,iµ max(b T t,1µ, b T t,2µ) + ηt,i where µ = ( 0.577, 0.577, 0.577, 0)T and ηt,i is generated from N(0, 0.012) independently over arms and time. We set the exploration parameter λ in the AC and Proposed algorithms to 0.001. We run the bandit algorithms until time horizon T = 50 with 100 repetitions. We repeated the evaluations on 30 bootstrap samples. |