Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Approximating Shapley Explanations in Reinforcement Learning

Authors: Daniel Beechey, Ozgur Simsek

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the use of Fast SVERL in multiple domains, guided by three questions on accuracy, efficiency, and scalability: (1) How well can the proposed models learn to approximate characteristic functions and Shapley values? (2) How many training updates are required to reach a given level of approximation error? (3) How does the computational cost of the approximation scale with the number of states and features in an environment? We start by focusing on outcome explanations, a natural choice because they depend on three components the behaviour characteristic, the outcome characteristic, and Shapley values that together span the key parametric models and loss functions introduced in Sections 3.1 and 3.2. We approximate all three components for a DQN agent in the Mastermind-222 domain used by Beechey et al. [2]. In the main paper, we present experiments on a subset of explanation types and domains, focusing on Mastermind-222, with eight features and 53 states, due to its tractability for exact Shapley value computation. In the appendix, analogous results for all explanation types, additional domains, and complete domain descriptions are provided, including experiments in larger domains where exact computation remains feasible only for behaviour and prediction explanations. Figure 1 shows the mean squared error (MSE) between predicted and exact values for (a) the behaviour characteristic, (b) outcome characteristic, and (c) outcome Shapley values, averaged over all states and features, plotted against training updates.
Researcher Affiliation Academia Daniel Beechey University of Bath United Kingdom EMAIL Özgür Sim sek University of Bath United Kingdom EMAIL
Pseudocode No The paper describes methods and equations (e.g., L(θ) = E pπ(s) E Unif(a) E p(C) |πa s(C) − πa s( ) − Pi C ˆϕi(s, a; θ)|2) but does not present a clearly labeled pseudocode block or algorithm.
Open Source Code Yes 1Fast SVERL code is available at: https://github.com/djeb20/fastsverl.
Open Datasets Yes We illustrate the use of Fast SVERL in multiple domains...We approximate all three components for a DQN agent in the Mastermind-222 domain used by Beechey et al. [2]. ... Appendix G provides full domain descriptions for Gridworld and Mastermind, outlining their configurations and rules, allowing for their reproduction.
Dataset Splits No The paper discusses sampling from the steady-state distribution of the policy being explained, or from a replay buffer. This describes data generation and usage in reinforcement learning, but not explicit, predefined training/test/validation dataset splits typically found in supervised learning.
Hardware Specification Yes All experiments were conducted on a local workstation equipped with the following specifications: Processor: Intel i9-14900K (24 cores, up to 6.0GHz) GPU: NVIDIA RTX 4090 (24GB VRAM) Memory: 96GB DDR5 RAM Storage: 2x1TB NVMe (Samsung 990 EVO) and 4TB SSD (Samsung 870 QVO)
Software Dependencies No The DQN agent used throughout the experiments is based on the implementation from Clean RL [11]. variability primarily stemming from sources typical to PyTorch-based training. While PyTorch and Clean RL are mentioned, specific version numbers are not provided.
Experiment Setup No The hyperparameters for all agents, characteristic models, and Shapley models were pragmatically chosen without tuning, as the experiments are intended to illustrate Fast SVERL s properties rather than benchmark against alternative methods. Initial values were selected, found to be sufficient for learning, and kept constant across experiments unless they were the specific subject of study or directly linked to design choices being evaluated. The only exception to this approach was the choice of the masking value used in behaviour and prediction characteristic models to represent unknown features. Although the theoretical framework permits any value outside the support of S, we found that large magnitude values hindered training stability, possibly due to amplified gradient magnitudes. In contrast, smaller magnitude values, closer to the support of S, resulted in smoother learning and were adopted for all experiments. While this describes the general approach to hyperparameters and one specific detail, it does not provide concrete numerical values for key hyperparameters such as learning rate, batch size, or optimizer settings.