Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees
Authors: Andrea Tirinzoni, Matteo Papini, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that BANDITSRL can be paired with any no-regret algorithm and achieve constant regret whenever an HLS representation is available. Furthermore, BANDITSRL can be easily combined with deep neural networks and we show how regularizing towards HLS representations is beneficial in standard benchmarks. |
| Researcher Affiliation | Collaboration | Andrea Tirinzoni Meta tirinzoni@meta.com Matteo Papini Universitat Pompeu Fabra matteo.papini@upf.edu Ahmed Touati Meta atouati@meta.com Alessandro Lazaric Meta lazaric@meta.com Matteo Pirotta Meta pirotta@meta.com |
| Pseudocode | Yes | Algorithm 1 BANDITSRL |
| Open Source Code | No | The paper states 'The code is available at the following URL.' but provides a placeholder 'URL' instead of a concrete link. |
| Open Datasets | Yes | The dataset-based problems statlog, magic, covertype, mushroom [34 37] are obtained from the standard multiclass-to-bandit conversion [6, 27]. |
| Dataset Splits | No | The paper mentions using 'standard benchmarks' and 'dataset-based problems' but does not specify exact train/validation/test split percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processors, or memory used for running experiments. The checklist provided with the paper also states 'No' for this information. |
| Software Dependencies | No | The paper mentions 'Pytorch' in the bibliography but does not specify the version number or other software dependencies with their versions used in the experiments. |
| Experiment Setup | Yes | In all the problems the reward function is highly non-linear w.r.t. contexts and actions and we use a network composed by layers of dimension [50, 50, 50, 50, 10] and Re Lu activation to learn the representation (i.e., d = 10). For the baseline algorithms (NEURALUCB, IGW) we report the regret of the best configuration on each individual dataset, while for NN-BANDITSRL we fix the parameters across datasets (i.e., αGLRT = 5). |