Probabilistic Inference in Reinforcement Learning Done Right
Authors: Jean Tarbouriech, Tor Lattimore, Brendan O'Donoghue
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 8 Numerical Experiments Grid World. We first study empirically how well VAPOR approximates P . Since E[ TS] = P , we estimate the latter by averaging over 1000 samples of the (random) TS occupancy measure (we denote it by TS (1000)). We design simple 10 10 Grid World MDPs with four cardinal actions, known dynamics and randomly generated reward. Figure 2 suggests that VAPOR and the TS average output similar approximations of P , thus showing the accuracy of our variational approximation in this domain. |
| Researcher Affiliation | Industry | Jean Tarbouriech Google Deep Mind jtarbouriech@google.com Tor Lattimore Google Deep Mind lattimore@google.com Brendan O Donoghue Google Deep Mind bodonoghue@google.com |
| Pseudocode | Yes | Algorithm 1 VAPOR learning algorithm For episode t = 1, 2, . . . do 1. Compute expected rewards Etr, transitions Et P, uncertainty measure p t 2. Solve VAPOR optimization problem t argmax (Et P ) V p t( ) from Equation (3) 3. Execute policy t l(s, a) t l(s, a), for l = 1, . . . , L end for |
| Open Source Code | No | The paper does not provide any explicit statement about open-source code release or a link to a code repository. |
| Open Datasets | Yes | We consider the Deep Sea domain where instead of using a tabular state representation, we feed a one-hot representation of the agent location into a neural network, using bsuite [58]. Finally, we investigate the performance of VAPOR-lite on the Atari benchmark [5]. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits (percentages, sample counts, or specific citations to predefined splits) to reproduce the experiments. |
| Hardware Specification | No | The paper does not specify any particular GPU, CPU, or TPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'CVXPY [11], specifically the ECOS solver [12]' but does not provide explicit version numbers for these software components. |
| Experiment Setup | Yes | Table 2: Hyperparameters used in the Atari experiments. Common Hyperparameter Value Discount factor 0.995 Replay buffer size 1e5 Replay fraction 0.9 Replay prioritization exponent 1.0 Adam step size 1e 4 of V-Trace( ) 0.9 Algorithm-specific Hyperparameter (Figure 10) None Fixed, scalar Tuned, scalar Tuned, state-action Entropy regularization / 0.01 / / Uncertainty scale scale 0.01 0.01 0.01 0.005 min / / 0.005 / max / / 10 / init / / 0.02 / step size / / 1e 4 / |