Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Authors: Fengyuan Liu, Nikhil Kandpal, Colin Raffel
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across multiple model families and datasets are provided in Section 4.1, with related work in Section 5 and a conclusion in section 6. |
| Researcher Affiliation | Academia | Fengyuan Liu , Nikhil Kandpal, Colin Raffel University of Toronto, Vector Institute Toronto, Ontario, Canada EMAIL |
| Pseudocode | No | The paper describes methods like 'Key-Value Caching', 'Hierarchical Attribution', 'Proxy Modeling', and 'Proxy Model Pruning' in detail within Sections 3.1-3.4, but these descriptions are presented as explanatory text rather than structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release a user-friendly and efficient implementation of AttriBoT to enable efficient LLM interpretability as well as encourage future development of efficient context attribution methods 1. 1 https://github.com/r-three/AttriBoT |
| Open Datasets | Yes | To this end, we focus on three open-book QA datasets: 1. SQuAD 2.0 (Rajpurkar et al., 2018): A reading comprehension benchmark... 2. Hotpot QA (Yang et al., 2018): A multi-hop question answering benchmark... 3. QASPER (Dasigi et al., 2021): A document-grounded, information-seeking question answering dataset... |
| Dataset Splits | No | The paper describes sampling 1000 examples from the datasets for evaluation after certain filtering steps (e.g., 'Randomly sample 1000 examples from the remaining set of examples' for SQuAD 2.0 and QASPER, 'Sample the first 1000 examples of the dataset' for Hotpot QA), but does not specify training/validation/test splits for these sampled examples or explicitly reference standard splits used for its evaluation process. |
| Hardware Specification | Yes | All experiments were run on servers with 4 NVIDIA A100 SXM4 GPUs with 80GB of VRAM. For experiments involving models with fewer than 15 billion parameters, we use only a single GPU. For models with more parameters, we use model parallelism across multiple GPUs. |
| Software Dependencies | No | The paper mentions specific models like 'all-MiniLM-L6-v2 Sentence BERT model' but does not provide specific version numbers for underlying software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow, CUDA) used for implementation. |
| Experiment Setup | Yes | For our experiments we set the significance level α = 0.05 and the maximum number of outliers k = 50. |