DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning
Authors: Zijian Zhou, Xiaoqiang Lin, Xinyi Xu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify the effectiveness of our approach for demonstration attribution while being computationally efficient. Leveraging the results, we then show how DETAIL can help improve model performance in real-world scenarios through demonstration reordering and curation. Finally, we experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance. |
| Researcher Affiliation | Academia | 1Department of Computer Science, National University of Singapore, Singapore 2Institute for Infocomm Research, A*STAR, Singapore 3Singapore-MIT Alliance for Research and Technology Centre, Singapore 4CSAIL, MIT, USA |
| Pseudocode | Yes | Algorithm 1 DETAIL |
| Open Source Code | Yes | Python code for reproducibility is also included in the supplemental materials. The precise repository references and other dependencies can be found in the code provided in the supplemental materials. |
| Open Datasets | Yes | We use the MNIST dataset [22]... We primarily evaluate our method on AG News (4 classes) [77], SST-2 (2 classes) [57], Rotten Tomatoes (2 classes) [50], and Subj (2 classes) [18] datasets which all admit classification tasks. The NLP datasets are also obtained from Huggingface s datasets API [36]. All datasets are freely downloadable with downloading code snippets written in the code. |
| Dataset Splits | Yes | In each trial, we randomly pick 20 demonstrations to form an ICL dataset and another 20 demonstrations as the validation set. |
| Hardware Specification | Yes | All our experiments about 7B white-box models are conducted on a single L40 GPU. All experiments involving 13B white-box models are conducted on a single H100 GPU. |
| Software Dependencies | Yes | All our experiments are conducted using Python3.10 on a Ubuntu 22.04.4 LTS distribution. We use Jax [13] for the experiments in Sec. 5.1 and use Py Torch 2.1.0 [5] for the experiments in Sec. 5.2 and Sec. 5.3. |
| Experiment Setup | Yes | We consider (for white-box models) mainly a Vicuna-7b v1.3 [78] and also a Llama-2-13b [60] on some tasks using greedy decoding... We set a relatively large λ = 1.0 for LLMs and a relatively small λ = 0.01 for our custom transformer. When detecting noisy demonstrations, we may not want to regularize β too much because we wish to retain the information captured by the eigenvalues of the hessian H which can be eroded with a larger λ. As such, for the noisy demonstration detection task, we set a very small λ = 10 9 to retain most of the information captured by H while ensuring that it is invertible. |