Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

Authors: Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüllermeier, Przemyslaw Biecek

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the MS COCO and Image Net-1k benchmarks validate that second-order methods, such as FIXLIP, outperform first-order attribution methods. [...] In experiments, we empirically validate the performance of FIXLIP with three metrics defined in Section 4, measure its computational efficiency, and demonstrate its utility in visual explanation of VLEs. We mainly use the openly available pre-trained CLIP models [57] of two sizes: Vi T-B/32 with 7 x 7 image patches and Vi T-B/16 with 14 x 14. Moreover, we demonstrate the broader applicability of FIXLIP to explain Sig LIP [75] and Sig LIP-2 [66] up to the Vi T-L/16 variant with 16 x 16 patches. We rely on two openly available datasets commonly used in explainability research: MS COCO [42] and Image Net-1k [15]; the latter specifically to design the pointing game evaluation considering zero-shot classification.
Researcher Affiliation	Academia	Hubert Baniecki University of Warsaw Warsaw University of Technology Maximilian Muschalik LMU Munich, MCML Fabian Fumagalli Bielefeld University Barbara Hammer Bielefeld University Eyke Hüllermeier LMU Munich, MCML, DFKI Przemyslaw Biecek University of Warsaw Warsaw University of Technology
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks. It describes methodologies in narrative text and refers to figures for visual processes, but no formal pseudocode is present.
Open Source Code	Yes	Code: https://github.com/hbaniecki/fixlip [...] We provide additional details on reproducibility in the Appendix, as well as the code to reproduce all experiments in this paper is available at https://github.com/hbaniecki/fixlip.
Open Datasets	Yes	Experiments on the MS COCO and Image Net-1k benchmarks validate that second-order methods, such as FIXLIP, outperform first-order attribution methods. [...] We rely on two openly available datasets commonly used in explainability research: MS COCO [42] and Image Net-1k [15]; the latter specifically to design the pointing game evaluation considering zero-shot classification.
Dataset Splits	Yes	Experiments with the CLIP models are performed using 1000 image text pairs from the MS COCO test set, for which each of the models predicted the highest similarity scores. Experiments with the Sig LIP-2 models are performed using 100 image text pairs to save computational resources, since we are not using it to compare with baseline methods. Regarding Image Net-1k, we use all 50 images from each of the following 10 class labels for constructing the pointing game evaluation: goldfish (1), husky (248), cat (282), plane (404), church (497), ipod (605), ball (805), tractor (866), banana (954), pizza (963).
Hardware Specification	Yes	We set the batch size to 64 for the base models and to 32 for the large models, performing computation on A100 GPUs with 40GB VRAM. [...] Experiments described in Section 5 and Appendix D were computed on a cluster consisting of 4 AMD Rome 7742 CPUs (256 cores), 4TB of RAM, and 16 A100 GPUs for about 15 days combined.
Software Dependencies	No	The paper mentions using "openly available models from Hugging Face with default hyperparameters" and also "the clip Python library [57, MIT License]". However, it does not provide specific version numbers for these libraries, Python itself, or other key software components like PyTorch/TensorFlow or CUDA, which are necessary for reproducible software dependency information.
Experiment Setup	Yes	We set the batch size to 64 for the base models and to 32 for the large models [...] For FIXLIP-p, we use the cross-modal estimator with a budget of 2^21, whereas FIXLIP with Shapley interactions uses the model-agnostic estimator with a budget of 2^17, yielding approximately similar runtime. We mainly experiment with p ∈ {0.3, 0.5, 0.7} [...]