Exploration via Epistemic Value Estimation

Authors: Simon Schmitt, John Shawe-Taylor, Hado van Hasselt

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments confirm that the EVE recipe facilitates efficient exploration in hard exploration tasks. and 6 Experiments We have proposed a general recipe for epistemic value estimation (EVE), derived a simple example agent from it in Section 4 and empirically evaluate it here. In Figure 2 we observe competitive performance on the Bsuite benchmarks (Osband et al. 2020), where our agent matches the state-of-the-art results from Osband, Aslanides, and Cassirer (2018) that employs an ensemble of 20 independent neural network copies each with their own copy of a random prior function, target network and optimizer state.
Researcher Affiliation Collaboration Simon Schmitt1,2, John Shawe-Taylor2, Hado van Hasselt1 1Deep Mind 2University College London, UK suschmitt@google.com
Pseudocode Yes Algorithm 1: Standard Q-Learning with ϵ-greedy exploration. and Algorithm 2: Epistemic Q-Learning using EVE with diagonal Fisher approximation. and Algorithm 3: Epistemic Q-Learning using EVE with diagonal Fisher approximation, burn-in, target networks, and ADAM optimization (omitting mini-batching details).
Open Source Code No The paper does not explicitly state that its code is open-source or provide a link to a code repository.
Open Datasets Yes We use the behaviour suite benchmark (Bsuite) with special focus on the Deep Sea environment to empirically analyze our epistemic Q-Learning example agent from Section 4. and Behaviour Suite Bsuite was introduced by Osband et al. (2020) to facilitate the comparison of agents not just in terms of total score but across meaningful capabilities (such as exploration, credit assignment, memory, and generalization with function approximators).
Dataset Splits No The paper mentions evaluating agents on
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory amounts, or cloud instances) used for running its experiments.
Software Dependencies No The paper mentions software components like 'neural network architecture', 'optimizer', 'target networks', 'Leaky-Re LU activations', 'ADAM optimization', and 'automatic differentiation frameworks', but it does not specify concrete version numbers for any of these or other software dependencies.
Experiment Setup Yes Exploration hyper-parameters were tuned among possible powers of 10 in the range [10−15, 1010]. and Exploration Parameters: Exploration scale ω, return variance σReturn2, Fisher learning rate β, Fisher regularization ϵ, burn-in steps Kburnin. Regular Parameters: Learning rate α, target network period Ktarget, neural network qθ, discount γ.