Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations

Authors: Siddhartha Gairola, Moritz Böhle, Francesco Locatello, Bernt Schiele

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we present an important finding that raises questions about the underlying assumptions of post-hoc attribution methods and their utility on downstream tasks. In particular, we find that the quality of attributions for pre-trained models can be highly dependent on how the classification head (i.e. the probe ) is trained, even if the model backbone remains frozen. [...] We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training), model architectures and analyze how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.
Researcher Affiliation	Collaboration	1Max Planck Institute for Informatics, Saarland Informatics Campus, Germany, 2Institute of Science and Technology Austria, 3Kyutai, France. EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper includes mathematical equations describing models and transformations (e.g., equations for LCE, BCE, conventional MLP, B-cos layer) but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	Code available at: https://github.com/sidgairo18/how-to-probe. [...] We provide the complete code for pre-training, probing and evaluation of the trained models as well as for generating the quantitative and qualitative results of the explanation methods used. The code is well-documented with helper scripts to run the different parts of the pipeline and help with reproducibility. Additionally, we also make the entire pipeline available to the broader community by open-sourcing our software and provide the pre-trained model checkpoints which further helps in reproducing the results in the manuscript. [...] Code to reproduce all experiments: https://github.com/sidgairo18/how-to-probe
Open Datasets	Yes	For single-label classification (Image Net (Russakovsky et al., 2015)), we employ the grid pointing game (Grid PG) (B ohle et al., 2021; 2022; Zhang et al., 2018; Samek et al., 2017). [...] For multi-label classification (VOC (Everingham et al., 2009), COCO (Lin et al., 2014)), we rely on the bounding box annotations provided in the datasets and use the energy pointing game (EPG) (Wang et al., 2020).
Dataset Splits	Yes	Image Net We use the Image Net-1K dataset that is part of the Image Net Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2015). This has 1000 classes, with roughly 1000 images belonging to each category. In total, there are 1,281,167 training images, and 50,000 validation images. [...] VOC 2007 (Everingham et al., 2009) is a popularly used multi-label image classification dataset. It comprises of 9,963 images in total and 20 object classes, and is split into the train-val set with 5,011 images and the test set with 4,952 images. [...] MS COCO 2014 Microsoft COCO (Lin et al., 2014) is another popular dataset generally used for image classification, segmentation, object detection and captioning tasks. We use COCO-2014 in our experiments, that has 82,081 training and 40,137 validation images and 80 object classes.
Hardware Specification	No	The paper mentions 'running large-scale training on GPUs' and 'efforts must be made to be more careful when using such resources' but does not specify any particular GPU models, CPU models, memory sizes, or other detailed hardware specifications used for the experiments.
Software Dependencies	No	For all methods except B-cos, LRP and LIME we use the implementations provided by the captum library (github.com/pytorch/captum). For Int Grad, similar to (B ohle et al., 2022) we set n steps = 50 for integrating over the gradients and a batch size = 16 to accommodate for limited compute. For computing LRP attributions we rely on the zennit library (https://github.com/chr5tphr/zennit) and use the epsilon gamma box composite. [...] For LIME attributions we use the official implementation available at https://github.com/marcotcr/lime. And for B-cos attributions we use the author provided implementation given at https://github.com/B-cos/B-cos-v2/. We also evaluate on explanation methods developed specifically for Vision Transformers (Kolesnikov et al., 2021). In particular we use CGW1 (Chefer et al., 2020), Rollout (Abnar & Zuidema, 2020) and Vi T-CX (Causal X, (Xie et al., 2022)). For CGW1 and Rollout we use the author provided implementation provided at https://github.com/hila-chefer/ Transformer-Explainability. And for Vi T-CX we use the official implementation available at https://github.com/vaynexie/Causal X-Vi T. The paper lists several libraries and tools, along with their source or associated papers, but does not specify version numbers for any of them.
Experiment Setup	Yes	We pretrain all models on the Image Net dataset (Russakovsky et al., 2015). For each self-supervised pre-training framework, we follow the standard recipes as mentioned in their respective works. To keep the configuration consistent we use a batch size of 256 for all models, distributed over 4 GPUs and train for 200 epochs. The learning rate for each SSL framework is updated following the linear scaling rule (Goyal et al., 2017): lr = 0.0005 batchsize/256. [...] For probing the pre-trained SSL features, on Image Net we train the probes for 100 epochs as is standard (Caron et al., 2021; Grill et al., 2020) and for 50 epochs when training on COCO and VOC datasets. [...] For fully supervised models and probing the pre-trained SSL features we use the following training configuration: the Adam (Kingma & Ba, 2015) optimizer, a batch size of 256, and a cosine learning rate schedule with warmup. Weight decay of 0.0001 is applied only for end-to-end training of standard models (i.e. non B-cos models).