Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Bits and Bandits: Quantifying the Regret-Information Trade-off

Authors: Itai Shufaro, Nadav Merlis, Nir Weinberger, Shie Mannor

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS 5.1 STOCHASTIC BAYESIAN BANDIT We begin by demonstrating the trade-off in a simple Bayesian MAB problem. ... Results are presented in Figure 2. 5.2 QUESTION ANSWERING WITH LARGE LANGUAGE MODELS In the following experiment, we show how the regret-information trade-off can be utilized in a practical setting. ... Table 2 describes the mean regret of every policy under a different number of large LLM deployments.
Researcher Affiliation Collaboration Itai Shufaro Technion EMAIL Nadav Merlis ENSAE EMAIL Nir Weinberger Technion EMAIL Shie Mannor Technion, NVIDIA Research EMAIL
Pseudocode No The paper describes methods using mathematical formulations and theoretical derivations (e.g., Theorem 3.4, Proposition 4.1) but does not include any explicit pseudocode blocks or algorithms formatted with numbered steps.
Open Source Code Yes Code is available here1. 1https://github.com/itaishufaro/bitsandbandits
Open Datasets Yes To generate the multiple-choice questions, we used the MMLU dataset, which contains multiple-choice questions in a variety of topics, such as algebra, business, etc. (Hendrycks et al., 2021b;a).
Dataset Splits No We used 10 different seeds to generate 10 sets of 200 questions from MMLU randomly. This describes data generation for the rounds but does not specify traditional training/validation/test dataset splits.
Hardware Specification Yes The experiments using Mistral 7B, Gemma 2B, Gemma 7B and Falcon 7B were evaluated using a machine with RTX4090. The experiments using Mixtral 8x7B and Llama3-70B were performed using a machine with 3x A40.
Software Dependencies No We applied 4-bit quantization (Dettmers & Zettlemoyer, 2023) for all models and flash attention (Dao et al., 2022) for all models excluding Falcon 7B. While specific techniques and LLM models are mentioned, the paper does not list specific version numbers for core software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We also fix K = 8, ε = 0.1 and a uniform prior over the problems and compare three different bandit algorithms... We receive a positive reward of 1 for answering the question correctly. If we choose to deploy the large LLM we also incur a negative reward of 0.1. ... The threshold value for the bits-based policy is selected to ensure that we query the large LLM the same number of times as the random policy. We used 10 different seeds to generate 10 sets of 200 questions from MMLU randomly.