Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization
Authors: Brendan O’Donoghue
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conclude with some results showing good performance of a deep RL agent using the technique on the challenging Deep Sea environment, showing significant performance improvements even over other efficient exploration techniques, as well as improved performance on the Atari benchmark. |
| Researcher Affiliation | Industry | 1Google DeepMind, London. Correspondence to: Brendan O Donoghue <bodonoghue85@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 Epistemic-risk-seeking actor-critic (ERSAC) |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating that the source code for its methodology is open-source or publicly available. |
| Open Datasets | Yes | Finally, we compare ERSAC + replay to an Actor-critic + replay agent on the Atari benchmark (Bellemare et al., 2012). |
| Dataset Splits | No | The paper describes the environments and how the agent interacts with them (e.g., 'agent finds itself at the top left of an L L grid' for Deep Sea, 'actors generating experience and sending them to a learner' for Atari), which is typical for reinforcement learning. However, it does not provide specific training/test/validation dataset splits (e.g., percentages or counts) as defined for static datasets in supervised learning. |
| Hardware Specification | No | The paper states that Bootstrapped DQN 'required a GPU to run efficiently' but does not specify the model or any other details of the GPU, CPU, or overall hardware configuration used for experiments. |
| Software Dependencies | No | The paper does not list specific version numbers for any software components, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | The off-policy agent here used a batch size of 16 with an offline-data fraction of 0.97 per batch. Replay was prioritized by TD-error and when sampling the replay prioritization exponent was 1.0 (Schaul et al., 2015). The replay noise parameter was ρ = 0.1. All other settings were identical to the on-policy variant. ... For our experiments we augmented the neural network with an ensemble of reward prediction heads with randomized prior functions (Osband et al., 2018), and used the variance of the ensemble predictions as the uncertainty signal. ... In our experiments we augmented the neural network with an ensemble of reward prediction heads with randomized prior functions (Osband et al., 2018), and used the variance of the ensemble predictions as the uncertainty signal. |