Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning
Authors: Zijian Zhou, Xiaoqiang Lin, Xinyi Xu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify the effectiveness of our approach for demonstration attribution while being computationally efficient. Leveraging the results, we then show how DETAIL can help improve model performance in real-world scenarios through demonstration reordering and curation. Finally, we experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance. |
| Researcher Affiliation | Academia | 1Department of Computer Science, National University of Singapore, Singapore 2Institute for Infocomm Research, A*STAR, Singapore 3Singapore-MIT Alliance for Research and Technology Centre, Singapore 4CSAIL, MIT, USA |
| Pseudocode | Yes | Algorithm 1 DETAIL |
| Open Source Code | Yes | Python code for reproducibility is also included in the supplemental materials. The precise repository references and other dependencies can be found in the code provided in the supplemental materials. |
| Open Datasets | Yes | We use the MNIST dataset [22]... We primarily evaluate our method on AG News (4 classes) [77], SST-2 (2 classes) [57], Rotten Tomatoes (2 classes) [50], and Subj (2 classes) [18] datasets which all admit classification tasks. The NLP datasets are also obtained from Huggingface s datasets API [36]. All datasets are freely downloadable with downloading code snippets written in the code. |
| Dataset Splits | Yes | In each trial, we randomly pick 20 demonstrations to form an ICL dataset and another 20 demonstrations as the validation set. |
| Hardware Specification | Yes | All our experiments about 7B white-box models are conducted on a single L40 GPU. All experiments involving 13B white-box models are conducted on a single H100 GPU. |
| Software Dependencies | Yes | All our experiments are conducted using Python3.10 on a Ubuntu 22.04.4 LTS distribution. We use Jax [13] for the experiments in Sec. 5.1 and use Py Torch 2.1.0 [5] for the experiments in Sec. 5.2 and Sec. 5.3. |
| Experiment Setup | Yes | We consider (for white-box models) mainly a Vicuna-7b v1.3 [78] and also a Llama-2-13b [60] on some tasks using greedy decoding... We set a relatively large λ = 1.0 for LLMs and a relatively small λ = 0.01 for our custom transformer. When detecting noisy demonstrations, we may not want to regularize β too much because we wish to retain the information captured by the eigenvalues of the hessian H which can be eroded with a larger λ. As such, for the noisy demonstration detection task, we set a very small λ = 10 9 to retain most of the information captured by H while ensuring that it is invertible. |