reproducibilityindex.ai

Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization

Authors: Brendan O’Donoghue

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conclude with some results showing good performance of a deep RL agent using the technique on the challenging Deep Sea environment, showing significant performance improvements even over other efficient exploration techniques, as well as improved performance on the Atari benchmark.
Researcher Affiliation	Industry	1Google DeepMind, London. Correspondence to: Brendan O Donoghue <bodonoghue85@gmail.com>.
Pseudocode	Yes	Algorithm 1 Epistemic-risk-seeking actor-critic (ERSAC)
Open Source Code	No	The paper does not provide any explicit statements or links indicating that the source code for its methodology is open-source or publicly available.
Open Datasets	Yes	Finally, we compare ERSAC + replay to an Actor-critic + replay agent on the Atari benchmark (Bellemare et al., 2012).
Dataset Splits	No	The paper describes the environments and how the agent interacts with them (e.g., 'agent finds itself at the top left of an L L grid' for Deep Sea, 'actors generating experience and sending them to a learner' for Atari), which is typical for reinforcement learning. However, it does not provide specific training/test/validation dataset splits (e.g., percentages or counts) as defined for static datasets in supervised learning.
Hardware Specification	No	The paper states that Bootstrapped DQN 'required a GPU to run efficiently' but does not specify the model or any other details of the GPU, CPU, or overall hardware configuration used for experiments.
Software Dependencies	No	The paper does not list specific version numbers for any software components, libraries, or programming languages used in the experiments.
Experiment Setup	Yes	The off-policy agent here used a batch size of 16 with an offline-data fraction of 0.97 per batch. Replay was prioritized by TD-error and when sampling the replay prioritization exponent was 1.0 (Schaul et al., 2015). The replay noise parameter was ρ = 0.1. All other settings were identical to the on-policy variant. ... For our experiments we augmented the neural network with an ensemble of reward prediction heads with randomized prior functions (Osband et al., 2018), and used the variance of the ensemble predictions as the uncertainty signal. ... In our experiments we augmented the neural network with an ensemble of reward prediction heads with randomized prior functions (Osband et al., 2018), and used the variance of the ensemble predictions as the uncertainty signal.