Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution

Authors: Fengyuan Liu, Nikhil Kandpal, Colin Raffel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across multiple model families and datasets are provided in Section 4.1, with related work in Section 5 and a conclusion in section 6.
Researcher Affiliation Academia Fengyuan Liu , Nikhil Kandpal, Colin Raffel University of Toronto, Vector Institute Toronto, Ontario, Canada EMAIL
Pseudocode No The paper describes methods like 'Key-Value Caching', 'Hierarchical Attribution', 'Proxy Modeling', and 'Proxy Model Pruning' in detail within Sections 3.1-3.4, but these descriptions are presented as explanatory text rather than structured pseudocode or algorithm blocks.
Open Source Code Yes We release a user-friendly and efficient implementation of AttriBoT to enable efficient LLM interpretability as well as encourage future development of efficient context attribution methods 1. 1 https://github.com/r-three/AttriBoT
Open Datasets Yes To this end, we focus on three open-book QA datasets: 1. SQuAD 2.0 (Rajpurkar et al., 2018): A reading comprehension benchmark... 2. Hotpot QA (Yang et al., 2018): A multi-hop question answering benchmark... 3. QASPER (Dasigi et al., 2021): A document-grounded, information-seeking question answering dataset...
Dataset Splits No The paper describes sampling 1000 examples from the datasets for evaluation after certain filtering steps (e.g., 'Randomly sample 1000 examples from the remaining set of examples' for SQuAD 2.0 and QASPER, 'Sample the first 1000 examples of the dataset' for Hotpot QA), but does not specify training/validation/test splits for these sampled examples or explicitly reference standard splits used for its evaluation process.
Hardware Specification Yes All experiments were run on servers with 4 NVIDIA A100 SXM4 GPUs with 80GB of VRAM. For experiments involving models with fewer than 15 billion parameters, we use only a single GPU. For models with more parameters, we use model parallelism across multiple GPUs.
Software Dependencies No The paper mentions specific models like 'all-MiniLM-L6-v2 Sentence BERT model' but does not provide specific version numbers for underlying software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow, CUDA) used for implementation.
Experiment Setup Yes For our experiments we set the significance level α = 0.05 and the maximum number of outliers k = 50.