Probabilistic Inference in Reinforcement Learning Done Right

Authors: Jean Tarbouriech, Tor Lattimore, Brendan O'Donoghue

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 8 Numerical Experiments Grid World. We first study empirically how well VAPOR approximates P . Since E[ TS] = P , we estimate the latter by averaging over 1000 samples of the (random) TS occupancy measure (we denote it by TS (1000)). We design simple 10 10 Grid World MDPs with four cardinal actions, known dynamics and randomly generated reward. Figure 2 suggests that VAPOR and the TS average output similar approximations of P , thus showing the accuracy of our variational approximation in this domain.
Researcher Affiliation Industry Jean Tarbouriech Google Deep Mind jtarbouriech@google.com Tor Lattimore Google Deep Mind lattimore@google.com Brendan O Donoghue Google Deep Mind bodonoghue@google.com
Pseudocode Yes Algorithm 1 VAPOR learning algorithm For episode t = 1, 2, . . . do 1. Compute expected rewards Etr, transitions Et P, uncertainty measure p t 2. Solve VAPOR optimization problem t argmax (Et P ) V p t( ) from Equation (3) 3. Execute policy t l(s, a) t l(s, a), for l = 1, . . . , L end for
Open Source Code No The paper does not provide any explicit statement about open-source code release or a link to a code repository.
Open Datasets Yes We consider the Deep Sea domain where instead of using a tabular state representation, we feed a one-hot representation of the agent location into a neural network, using bsuite [58]. Finally, we investigate the performance of VAPOR-lite on the Atari benchmark [5].
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits (percentages, sample counts, or specific citations to predefined splits) to reproduce the experiments.
Hardware Specification No The paper does not specify any particular GPU, CPU, or TPU models used for running the experiments.
Software Dependencies No The paper mentions using 'CVXPY [11], specifically the ECOS solver [12]' but does not provide explicit version numbers for these software components.
Experiment Setup Yes Table 2: Hyperparameters used in the Atari experiments. Common Hyperparameter Value Discount factor 0.995 Replay buffer size 1e5 Replay fraction 0.9 Replay prioritization exponent 1.0 Adam step size 1e 4 of V-Trace( ) 0.9 Algorithm-specific Hyperparameter (Figure 10) None Fixed, scalar Tuned, scalar Tuned, state-action Entropy regularization  / 0.01 / / Uncertainty scale scale 0.01 0.01 0.01 0.005 min / / 0.005 / max / / 10 / init / / 0.02 / step size / / 1e 4 /