Bandit Learning with Delayed Impact of Actions

Authors: Wei Tang, Chien-Ju Ho, Yang Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose an algorithm that achieves a regret of O(KT 2/3) and show a matching regret lower bound of (KT 2/3), where K is the number of arms and T is the learning horizon. Our results complement the bandit literature by adding techniques to deal with actions with long-term impacts and have implications in designing fair algorithms. Finally, we conduct a series of simulations showing that our algorithms compare favorably to other state-of-the-art methods proposed in other application domains.
Researcher Affiliation Academia Wei Tang , Chien-Ju Ho , and Yang Liu Washington University in St. Louis, University of California, Santa Cruz {w.tang, chienju.ho}@wustl.edu, yangliu@ucsc.edu
Pseudocode Yes Algorithm 1 Action-Dependent UCB; Algorithm 2 Reduction Template; Algorithm 3 History-Dependent UCB
Open Source Code No The paper does not provide any specific links or explicit statements about the availability of its source code.
Open Datasets No The paper refers to 'simulations' but does not specify any publicly available datasets used, nor does it provide any concrete access information for data.
Dataset Splits No The paper mentions 'simulations' but does not provide specific dataset split information (e.g., percentages, sample counts, or cross-validation details) needed to reproduce data partitioning.
Hardware Specification No The paper mentions 'simulations' but does not provide specific hardware details such as exact GPU/CPU models or processor types used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup No The paper states 'The detailed setups and discussion are in Appendix I' but Appendix I is not provided. The main text does not contain specific experimental setup details such as hyperparameter values or training configurations.