Model-free Posterior Sampling via Learning Rate Randomization

Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Ménard

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical study shows that Rand QL outperforms existing approaches on baseline exploration environments.5 Experiments In this section we present the experiments we conducted for tabular environments using rlberry library [Domingues et al., 2021a]. We also provide experiments in non-tabular environment in Appendix I.
Researcher Affiliation Collaboration 1CMAP, École Polytechnique 2HSE University 3Duisburg-Essen University 4Google Deep Mind 5Mohamed Bin Zayed University of AI, UAE 6IDEMIA 7ENS Lyon
Pseudocode Yes Algorithm 1 Tabular Staged-Rand QL
Open Source Code No In this section we present the experiments we conducted for tabular environments using rlberry library [Domingues et al., 2021a].
Open Datasets No Environment We use a grid-world environment with 100 states (i, j) [10] [10] and 4 actions (left, right, up and down)... The second one is a chain environment described by Osband et al. [2016] with L = 15 states and 2 actions (left or right)... We use a ball environment with the 2-dimensional unit Euclidean ball as state-space S = {s R2, s 2 1} and of horizon H = 30.
Dataset Splits No The paper refers to training an agent in an environment but does not provide specific dataset split information (e.g., percentages, sample counts for train/validation/test sets, or citations to predefined splits).
Hardware Specification Yes For all experiments we used 2 CPUs (Intel Xeon CPU 2.20GHz), and no GPU was used.
Software Dependencies No In this section we present the experiments we conducted for tabular environments using rlberry library [Domingues et al., 2021a].
Experiment Setup Yes For these algorithms we used the same parameters: posterior inflation κ = 1.0, n0 = 1/S prior sample (same as PSRL, see below), ensemble size J = 10. For DQN and Boot DQN we use as netwrok a 2-layer multilayer perceptron (MLP) with hidden layer size equals to 64.