Model-free Posterior Sampling via Learning Rate Randomization
Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Ménard
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study shows that Rand QL outperforms existing approaches on baseline exploration environments.5 Experiments In this section we present the experiments we conducted for tabular environments using rlberry library [Domingues et al., 2021a]. We also provide experiments in non-tabular environment in Appendix I. |
| Researcher Affiliation | Collaboration | 1CMAP, École Polytechnique 2HSE University 3Duisburg-Essen University 4Google Deep Mind 5Mohamed Bin Zayed University of AI, UAE 6IDEMIA 7ENS Lyon |
| Pseudocode | Yes | Algorithm 1 Tabular Staged-Rand QL |
| Open Source Code | No | In this section we present the experiments we conducted for tabular environments using rlberry library [Domingues et al., 2021a]. |
| Open Datasets | No | Environment We use a grid-world environment with 100 states (i, j) [10] [10] and 4 actions (left, right, up and down)... The second one is a chain environment described by Osband et al. [2016] with L = 15 states and 2 actions (left or right)... We use a ball environment with the 2-dimensional unit Euclidean ball as state-space S = {s R2, s 2 1} and of horizon H = 30. |
| Dataset Splits | No | The paper refers to training an agent in an environment but does not provide specific dataset split information (e.g., percentages, sample counts for train/validation/test sets, or citations to predefined splits). |
| Hardware Specification | Yes | For all experiments we used 2 CPUs (Intel Xeon CPU 2.20GHz), and no GPU was used. |
| Software Dependencies | No | In this section we present the experiments we conducted for tabular environments using rlberry library [Domingues et al., 2021a]. |
| Experiment Setup | Yes | For these algorithms we used the same parameters: posterior inflation κ = 1.0, n0 = 1/S prior sample (same as PSRL, see below), ensemble size J = 10. For DQN and Boot DQN we use as netwrok a 2-layer multilayer perceptron (MLP) with hidden layer size equals to 64. |