Making Sense of Reinforcement Learning and Probabilistic Inference
Authors: Brendan O'Donoghue, Ian Osband, Catalin Ionescu
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 COMPUTATIONAL EXPERIMENTS; Figure 1: Regret scaling on Problem 1.; Figure 3: Learning times for Deep Sea experiments. |
| Researcher Affiliation | Industry | Deep Mind, London, UK, {bodonoghue,iosband,cdi}@google.com |
| Pseudocode | Yes | Table 1: Model-based Thompson sampling.; Table 2: Soft Q-learning.; Table 3: K-learning. |
| Open Source Code | No | The paper does not provide an explicit statement of code release or a link to a repository containing the source code for the methodology described. |
| Open Datasets | Yes | Our next set of experiments considers the Deep Sea MDPs introduced by Osband et al. (2017).; We then evaluate all of the algorithms on bsuite: A suite of benchmark tasks designed to highlight key issues in RL (Osband et al., 2019). |
| Dataset Splits | No | The paper does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or sample counts for each split), which are not typical for the Reinforcement Learning problems studied here where agents learn in environments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud computing instance types) used for running experiments. |
| Software Dependencies | No | The paper mentions the 'Adam optimizer' but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | All three algorithms used the same neural network architecture consisting of an MLP (multilayer perceptron) with a single hidden layer with 50 hidden units. All three algorithms used a replay buffer of the most recent 104 transitions to allow re-use of data. For all three the Adam optimizer (Kingma & Ba, 2014) was used with learning rate 10 3 and batch-size 128, and learning is performed at every time-step. For both K-learning and soft Q-learning the temperature was set at β 1 = 0.01. For Bootstrap DQN we chose an ensemble of size 20, and used the randomized prior functions (Osband et al., 2018) with scale 3.. For K-learning, in order to estimate the cumulant generating function of the reward, we used an ensemble of neural networks predicting the reward for each state and action and used these to calculate the empirical cumulant generating function over them. Each of these was a single hidden layer MLP with 10 hidden units. Finally, we noted that actually training a small ensemble of K-networks performed better than a single network, we used an ensemble of size 10 for this purpose as well as using randomized priors to encourage diversity between the elements of the ensemble with scale 1.0. The K-learning policy was the Boltzmann policy over all the ensemble K-values at each state. |