Combinatorial Bandits with Relative Feedback
Authors: Aadirupa Saha, Aditya Gopalan
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical results are corroborated with empirical evaluations. We also provide extensive numerical evaluations supporting our theoretical findings. In this section we present the empirical evaluations of our proposed algorithm Max Min-UCB (abbreviated as MM) on different synthetic datasets, and also compare them with different algorithms. All results are reported as average across 50 runs along with the standard deviations. For this we use 7 different MNL(n, θ) environments as described below: MNL(n, θ) Environments. 1. g1, 2. g4, 3. arith, 4. geo, 5. har all with n = 16, and two larger models 6. arithb, and 7. geob with n = 50 items in both. Details are moved to Appendix E. We compare our proposed methods with the following two baselines which closely applies to our problem setup. Note, as discussed in Sec. 1, none of the existing work exactly addresses our problem. Algorithms. 1. BD: The Battling-Duel algorithm of [36] with RUCB aalgorithm [47] as the dueling bandit blackbox, and 2. Sp-TS: The Self-Sparring algorithm of [39] with Thompson Sampling [1], and 3. MM: Our proposed method Max Min-UCB for Winner-regret (Alg. 1). Comparing Winner-regret with Top-m-ranking Feedback (Fig. 1): We first compare the regret performances for k = 10 and m = 5. From Fig. 1, it clearly follows that in all cases Max Min-UCB uniformly outperforms the other two algorithms taking the advantage of Top-m-ranking Feedback which the other two fail to make use of as they both allow repetitions in the played subsets which can not exploit the rank-ordered feedback to the full extent. Furthermore, the thompson sampling based Sp-TS in general exhibits a much higher variance compared to the rest due to its bayesian nature. Also as expected, g1 and g4 being comparatively easier instances, i.e. with larger gap ˆ max (see Thm. 3, 4,5, 6 etc. for a formal justification), our algorithm converges much faster on these models. Figure 1: Comparative performances on Winner-regret for k = 10, m = 5 Comparing Top-k-regret performances for Top-k-ranking Feedback (Fig. 2): We are not aware of any existing algorithm for Top-k-regret objective with Top-k-ranking Feedback. We thus use a modified version of Sp-TS algorithm [39] described above for the purpose it simply draws k-items without repetition and uses Rank-Breaking updates to maintain the Beta posteriors. Here again, we see that our method Rec-Max Min-UCB (Rec-MM) uniformly outperforms Sp-TS in all cases, and as before Sp-TS shows a higher variability as well. Interestingly, our algorithm converges the fastest on g4, it being the easiest model with largest gap (k) between the kth and (k + 1)th best item (see Thm. 7,8,9 etc.), and takes longest time for har since it has the smallest (k). Figure 2: Comparative performances on Top-k-regret for k = 10 Effect of varying m with fixed k (Fig. 3): We also studied our algorithm Max Min-UCB , with varying size rank-ordered feedback (m), keeping the subsetsize (k) fixed, both for Winner-regret and Top-k-regret objective, on the larger models arithb and geob which has n = 50 items. As expected, in both cases, regret scales down with increasing m (justifying the bounds in Thm. 5,6),8,9). Figure 3: Regret with varying m with fixed k = 40 (on our proposed algorithm Max Min-UCB) |
| Researcher Affiliation | Academia | Aadirupa Saha Indian Institute of Science, Bangalore aadirupa@iisc.ac.in Aditya Gopalan Indian Institute of Science, Bangalore aditya@iisc.ac.in |
| Pseudocode | Yes | The complete algorithm is presented in Alg. 1, Appendix C.1. The set building rule build_S is at the heart of Max Min-UCB which builds the subset St from a set of potential Condorcet winners (Ct) of round t: By recursively picking the strongest opponents of the already selected items using a max-min selection strategy on uij. The complete algorithm is presented in Alg. 1, Appendix C.1. We next propose an UCB based algorithm (Alg. 3) for the same, along with a matching upper bound regret analysis (Thm. 8,9) showing optimality of our proposed algorithm. The algorithm is described in Alg. 3, Appendix D.2. |
| Open Source Code | No | The paper does not provide any specific links or statements about the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper mentions using "MNL(n, θ) Environments" such as "g1, g4, arith, geo, har, arithb, and geob" but does not provide concrete access information (link, DOI, formal citation with authors/year) for these datasets, which appear to be synthetic and generated for the experiments rather than being pre-existing public datasets with defined access points. |
| Dataset Splits | No | The paper mentions running experiments across 50 runs and varying parameters like 'm' and 'k', but it does not specify any training, validation, or test dataset splits in terms of percentages, sample counts, or references to predefined splits. |
| Hardware Specification | No | The paper does not specify any details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions implementing algorithms and using baselines, but it does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, specific library versions). |
| Experiment Setup | No | The paper describes the |