Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution

Authors: Fengyuan Liu, Nikhil Kandpal, Colin Raffel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across multiple model families and datasets are provided in Section 4.1, with related work in Section 5 and a conclusion in section 6.
Researcher Affiliation	Academia	Fengyuan Liu , Nikhil Kandpal, Colin Raffel University of Toronto, Vector Institute Toronto, Ontario, Canada EMAIL
Pseudocode	No	The paper describes methods like 'Key-Value Caching', 'Hierarchical Attribution', 'Proxy Modeling', and 'Proxy Model Pruning' in detail within Sections 3.1-3.4, but these descriptions are presented as explanatory text rather than structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release a user-friendly and efficient implementation of AttriBoT to enable efficient LLM interpretability as well as encourage future development of efficient context attribution methods 1. 1 https://github.com/r-three/AttriBoT
Open Datasets	Yes	To this end, we focus on three open-book QA datasets: 1. SQuAD 2.0 (Rajpurkar et al., 2018): A reading comprehension benchmark... 2. Hotpot QA (Yang et al., 2018): A multi-hop question answering benchmark... 3. QASPER (Dasigi et al., 2021): A document-grounded, information-seeking question answering dataset...
Dataset Splits	No	The paper describes sampling 1000 examples from the datasets for evaluation after certain filtering steps (e.g., 'Randomly sample 1000 examples from the remaining set of examples' for SQuAD 2.0 and QASPER, 'Sample the first 1000 examples of the dataset' for Hotpot QA), but does not specify training/validation/test splits for these sampled examples or explicitly reference standard splits used for its evaluation process.
Hardware Specification	Yes	All experiments were run on servers with 4 NVIDIA A100 SXM4 GPUs with 80GB of VRAM. For experiments involving models with fewer than 15 billion parameters, we use only a single GPU. For models with more parameters, we use model parallelism across multiple GPUs.
Software Dependencies	No	The paper mentions specific models like 'all-MiniLM-L6-v2 Sentence BERT model' but does not provide specific version numbers for underlying software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow, CUDA) used for implementation.
Experiment Setup	Yes	For our experiments we set the significance level α = 0.05 and the maximum number of outliers k = 50.