Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Combinatorial Bandits for Maximum Value Reward Function under Value-Index Feedback

Authors: Yiliu Wang, Wei Chen, Milan Vojnovic

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our algorithm through experimental results. ... Our numerical results are presented in Section 5. ... We perform experiments to evaluate the performance of Algorithm 3.1 on specific problem instances. We compare our algorithm with three baseline methods: the well-known UCB algorithm treating each set of size k as one arm, the standard CUCB algorithm for semi-bandit CMAB (Chen et al., 2013), and DART (Agarwal et al., 2021b) for full-bandit CMAB. ... Results Figure 1 presents the regrets of Algorithm 3.1 and three baseline methods across the three scenarios.
Researcher Affiliation Collaboration Yiliu Wang Allen Institute Seattle, WA 98109, USA EMAIL Wei Chen Microsoft Research Beijing, China EMAIL Milan Vojnovi c Department of Statistics, LSE London, United Kingdom EMAIL
Pseudocode Yes Algorithm 3.1 CUCB algorithm for unknown ordering of values ... Algorithm 4.1 CUCB algorithm for arbitrary distributions with finite supports
Open Source Code Yes The code we use is available at: https://github.com/Sketch-EXP/kmax.
Open Datasets No The paper defines three custom distributions (D1, D2, D3) for its experiments in Appendix F but does not state that these or any other datasets are publicly available or provide a link/citation for them. It describes the setup of arm distributions (e.g., v = (0.1, 0.2, ..., 0.9), p is such that pi = 0.3).
Dataset Splits No The paper does not specify explicit training/validation/test splits. It mentions running experiments for a time horizon T=5000 and repeating them 20 times to average cumulative regrets. However, it does not describe how the data itself (arm outcomes) is split for training, validation, or testing purposes in a way that would allow for reproduction of data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies. It only mentions using 'the well-known UCB algorithm', 'standard CUCB algorithm', and 'DART' without specifying their software implementations or versions.
Experiment Setup Yes We consider settings with n = 9 arms and sets of cardinality k = 3. ... We run each experiment for a horizon time of T = 5000. In each round, we select arms according to the offline oracle and sample their values for updates. We compare the reward to that of the optimal set S*. We repeat the experiments 20 times and average the cumulative regrets. ... We tested on three different arm distributions such that a single arm changes from one scenario to another. We expect the algorithm to perform well for all three cases. For space reasons, we show the detailed setup in Appendix F.