Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BEE: Metric-Adapted Explanations via Baseline Exploration-Exploitation

Authors: Oren Barkan, Yehonatan Elisha, Jonathan Weill, Noam Koenigstein

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations across various model architectures showcase the superior performance of BEE in comparison to state-of-the-art explanation methods on a variety of objective evaluation metrics. Our experiments aim to address the following research questions (RQs): 1) Does the BEE method outperform state-of-the-art methods? 2) Does BEE finetuning improve upon pretraining? 3) Do different metrics favor different explanation maps and baselines? 4) How does the number of sampled baselines T affect BEE performance? 5) How does the performance of adaptive baseline sampling compare to nonadaptive sampling? 6) Does the learned baseline distribution obtained by BEE converge to the best-performing baseline distribution per metric? 7) Does integration on intermediate representation gradients improve upon integration on input gradients? 8) What is the contribution from context modeling in BEE? 9) Can other path-integration methods benefit from BEE? The primary manuscript addresses RQs 16 comprehensively. Specifically, RQs 1-2 are addressed in Tabs. 1 and 2, RQ 3 is addressed in Tab. 3 and Fig. 2, and RQs 4-6 are addressed in Fig. 2. Due to space limitations, experiments addressing RQs 7-9, along with additional analyses and ablation studies, are provided in the Appendix.
Researcher Affiliation	Academia	Oren Barkan1, Yehonatan Elisha2, Jonathan Weill2, Noam Koenigstein2 1The Open University, Israel 2Tel Aviv University, Israel
Pseudocode	No	The paper describes the steps of the BEE procedure in Section 3.2 using a numbered list (1. For each z B, draw wz from a normal distribution..., 2. Draw a baseline b..., 3. Compute the metric score..., 4. u , θ argmin u,θ log σ(yu cθ(x)) + 1 2 PK i=1 qb i(ui gb i)2. 5. gb u , θ θ . 6. qb i qb i + σ(gb cθ(x))σ( gb cθ(x))cθ(x)2 i .). However, it does not use a dedicated 'Pseudocode' or 'Algorithm' block with formal pseudocode formatting.
Open Source Code	Yes	Code https://github.com/yonis Git/BEE
Open Datasets	Yes	In accordance with previous works (Kapishnikov et al. 2019, 2021; Xu, Venugopalan, and Sundararajan 2020; Chefer, Gur, and Wolf 2021b) we use the Image Net (Deng et al. 2009) ILSVRC 2012 (IN) validation set as our test set, which contains 50,000 images from 1,000 classes.
Dataset Splits	Yes	we use the Image Net (Deng et al. 2009) ILSVRC 2012 (IN) validation set as our test set, which contains 50,000 images from 1,000 classes. For the pretraining phase, we used a separate training set of 5000 examples taken from the IN training set, avoiding overlap with the validation set used as a test set.
Hardware Specification	Yes	The experiments were conducted on an NVIDIA DGX 8x A100 Server.
Software Dependencies	No	The paper mentions that "Optimization in both the pretraining and finetuning phases was carried out using the Adam optimizer." but does not provide specific version numbers for Adam or any other software components.
Experiment Setup	Yes	Unless stated otherwise, we sampled T = 8 baselines per test instance, and n = 10 interpolation steps in the integration process (Eq. 3). The integration was employed on the last convolutional / attention layer, i.e., we set I = {L} (Eq. 4). A comparison of various settings of I, including L 1 and L 2 is presented in the Appendix. The dimension of the context representation K was set to match the output dimension of each backbone separately. Optimization in both the pretraining and finetuning phases was carried out using the Adam optimizer. For precise optimization details, please refer to the Appendix and our Git Hub repository.