Model-Value Inconsistency as a Signal for Epistemic Uncertainty

Authors: Angelos Filos, Eszter Vértes, Zita Marinho, Gregory Farquhar, Diana Borsa, Abram Friesen, Feryal Behbahani, Tom Schaul, Andre Barreto, Simon Osindero

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful (i) as a signal for exploration, (ii) for acting safely under distribution shifts, and (iii) for robustifying value-based planning with a learned model.
Researcher Affiliation Collaboration 1Deep Mind 2University of Oxford.
Pseudocode No The paper contains mathematical equations and diagrams to describe the methods, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code No The paper states using existing open-source libraries (JAX, TensorFlow, Muesli, VPN, Dreamer) for implementation but does not provide a statement about releasing their own source code for the methodology described in this paper.
Open Datasets Yes In the deep RL experiments, we use a selection of 5 tasks from the procgen suite (Cobbe et al., 2019)... We also use a modification of the walker walk task from the Deep Mind Control suite (Tunyasuvunakool et al., 2020)... Lastly, we use the original minatar (Young & Tian, 2019) suite for fast experimentation with value-based agents (Mnih et al., 2013).
Dataset Splits Yes Figure 4 (left) reports the final performance of the agent evaluated on an additional 10M frames on the train and test levels. Values are normalised by the min and max scores for each game. Right: σ-IVE(5) computed using the model of the Muesli agent while evaluating on both training and unseen test levels, for different numbers of unique levels seen during training.
Hardware Specification No The paper mentions the software libraries used (JAX, TensorFlow) but does not specify any hardware details like CPU/GPU models, memory, or cloud instance types used for experiments.
Software Dependencies No The paper mentions key software components such as Python, JAX, TensorFlow, and Matplotlib along with their respective citations, but it does not specify the version numbers for any of these software dependencies.
Experiment Setup Yes We use an empty 5x5 gridworld, and collect data by rolling out a uniformly random policy, initialised at the bottom right cell. We use the Dreamer agent's default hyperparameters. The self-inconsistency-seeking variant, i.e., µ + σ-IVE(5), we used a scalar weighting factor β IVE = 0.1 to balance the mean and standard deviation across the ensemble members, tuned with grid search in {0.05, 0.1, 0.2, 1.0, 10.0}. ADAM (Kingma & Ba, 2014) optimiser with learning rate 5e-5 is used, and all losses converge after 10,000 epochs of stochastic gradient descent with batch size 128. We train Muesli for 100M environment frames and use the fraction of replay data in each batch to 0.8.