Factored DRO: Factored Distributionally Robust Policies for Contextual Bandits
Authors: Tong Mu, Yash Chandak, Tatsunori B. Hashimoto, Emma Brunskill
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also introduce a practical algorithm and demonstrate promising empirical results in environments based on real-world datasets, such as voting outcomes and scene classification. We test and analyze our algorithm empirically in three settings, including two derived from real data. For all experiments, we use the practical algorithm discussed in Section 5. While our method works for any policy that can be optimized, we consider learning the parameters of a stochastic softmax linear contextual bandit policy of the form: ... We compare against three baselines: |
| Researcher Affiliation | Academia | Tong Mu Stanford university tongm@cs.stanford.edu Yash Chandak University of Massachusetts ychandak@cs.umass.edu Tatsunori Hashimoto Stanford University thashim@stanford.edu Emma Brunskill Stanford University ebrun@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 in the Appendix provides our full algorithm for policy evaluation and optimization, that considers both the discrete binned case (Section 4.2) and the function approximation case just discussed. |
| Open Source Code | Yes | 1Code: https://github.com/Stanford AI4HI/Factored DRO |
| Open Datasets | Yes | Scene Setting: We additionally test in a setting derived from the multiclass-supervised learning Scene classification dataset from the Lib SVM [Chang and Lin, 2011] repository. Voting Setting: The voting dataset by Gerber et al. [2008] contains data collected from a randomized controlled trial-style study... |
| Dataset Splits | No | Table 1 provides 'ntotal, (ntrain, ntest)' splits for the datasets (e.g., '40K, (20K, 20K)'), indicating train and test sets, but no explicit mention of a separate validation split or how it's handled for hyperparameter tuning. |
| Hardware Specification | No | Experiments were run on an internal cluster running GPUs. The total amount of compute time was about 200 GPU hours. This statement mentions GPUs but does not specify the exact GPU models (e.g., NVIDIA A100, Tesla V100) or other specific hardware components required to precisely replicate the environment. |
| Software Dependencies | No | The paper does not explicitly state the version numbers for any software dependencies, such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other specific tools used for the experiments. |
| Experiment Setup | Yes | For the Adam optimizer we use an initial learning rate of 1e-3, with exponential decay rate of 0.999 per epoch and a batch size of 100. For the policy we use a softmax temperature parameter of 1.0. For the approximate Taylor Expansion (Section 5.1), we use a order of 5, and use Adam optimizer with learning rate 1e-4 and batch size 100. |