Incentivized Learning in Principal-Agent Bandit Games

Authors: Antoine Scheid, Daniil Tiapkin, Etienne Boursier, Aymeric Capitaine, Eric Moulines, Michael Jordan, El-Mahdi El-Mhamdi, Alain Oliviero Durmus

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we support our theoretical guarantees through numerical experiments.
Researcher Affiliation Academia 1Centre de Math ematiques Appliqu ees CNRS Ecole polytechnique Institut Polytechnique de Paris route de Saclay 91128 Palaiseau cedex 2Universit e Paris-Saclay, CNRS, Laboratoire de math ematiques d Orsay, 91405, Orsay, France 3INRIA, Universite Paris Saclay, LMO, Orsay, France 4University of California, Berkeley 5Inria, Ecole Normale Sup erieure, PSL Research University.
Pseudocode Yes Algorithm 1 IPA Algorithm 2 Contextual IPA Algorithm 3 Binary Search Subroutine Algorithm 4 UCB Subroutine Algorithm 5 Projected Volume
Open Source Code No The paper does not provide any explicit statement about releasing source code or links to a code repository for the described methodology.
Open Datasets No We ran the experiments in Figure 2 for a horizon T = 10 000 on an average of 100 runs on a five arms bandit. We plotted the standard error across the different runs. The expected rewards for the principal (θ) and the agent (s) are given in Table 3. The principal s rewards Xa(t) are drawn from an i.i.d. distribution Xa(t) N(θa, 1) for any a [K], t [T]. The paper describes how the data for the 'toy example' experiments was generated, including specific parameters in Table 3, but does not provide a link or formal citation to a publicly available dataset.
Dataset Splits No The paper does not specify any training, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions comparing with 'Principal s ε-Greedy algorithm of Dogan et al. (2023b)' and using 'UCB instance' but does not specify any software names with version numbers (e.g., Python version, specific libraries or frameworks).
Experiment Setup Yes We ran the experiments in Figure 2 for a horizon T = 10 000 on an average of 100 runs on a five arms bandit. The expected rewards for the principal (θ) and the agent (s) are given in Table 3. The principal s rewards Xa(t) are drawn from an i.i.d. distribution Xa(t) N(θa, 1) for any a [K], t [T]. For the Principal s ε-Greedy algorithm, we use the hyperparameters α = 1 and m = 500.