Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Decoupling Pixel Flipping and Occlusion Strategy for Consistent XAI Benchmarks

Authors: Stefan Bluecher, Johanna Vielhaben, Nils Strodthoff

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study proposes two complementary perspectives to resolve this disagreement problem. Firstly, we address the common criticism of occlusion-based XAI, that artificial samples lead to unreliable model evaluations. We propose to measure the reliability by the R(eference)Out-of-Model-Scope (OMS) score. The R-OMS score enables a systematic comparison of occlusion strategies and resolves the disagreement problem by grouping consistent PF rankings. Secondly, we show that the insightfulness of MIF and LIF is conversely dependent on the R-OMS score. To leverage this, we combine the MIF and LIF measures into the symmetric relevance gain (SRG) measure. This breaks the inherent connection to the underlying occlusion strategy and leads to consistent rankings. This resolves the disagreement problem of PF benchmarks, which we verify for a set of 40 different occlusion strategies.
Researcher Affiliation	Academia	Stefan Blücher EMAIL BIFOLD Berlin Institute for the Foundations of Learning and Data Machine Learning Group, TU Berlin Johanna Vielhaben EMAIL Explainable Artificial Intelligence Group Fraunhofer Heinrich-Hertz-Institute Nils Strodthoff EMAIL Division AI4Health Carl von Ossietzky Universität Oldenburg
Pseudocode	No	The paper describes methods and measures in narrative text and mathematical equations (e.g., Equation (1), Equation (2), Equation (3), Equation (4), Equation (5)) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available on at https://github.com/bluecher31/pixel-flipping.
Open Datasets	Yes	All results are based on 100 randomly selected imagenet samples.
Dataset Splits	No	All results are based on 100 randomly selected imagenet samples. Based on Section 3, we construct a diverse set of 40 occlusion strategies, varying all design choices (n: 25, 100, 500, 5000; imputer: mean, train set, histogram, cv2, diffusion; model: standard-Res Net50, timm-Res Net50). This text describes the samples used for evaluation but does not specify train/test/validation splits for model training or how the 100 samples are used in relation to potential splits if the models were trained on ImageNet.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or detailed computer specifications used for running its experiments.
Software Dependencies	No	Gradient-based attributions are calculated using captum (Kokhlikyan et al., 2020), LRP using zennit (Anders et al., 2021). These software libraries are mentioned without specific version numbers.
Experiment Setup	Yes	Setup This section explores the impact of different occlusion strategies on PF benchmarks. All results are based on 100 randomly selected imagenet samples. Based on Section 3, we construct a diverse set of 40 occlusion strategies, varying all design choices (n: 25, 100, 500, 5000; imputer: mean, train set, histogram, cv2, diffusion; model: standard-Res Net50, timm-Res Net50).