ContextCite: Attributing Model Generation to Context
Authors: Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, Aleksander Madry
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate CONTEXTCITE on up to 1, 000 random validation examples from each of three representative benchmarks: 1. Ty Di QA [30] is a question-answering dataset in which the context is an entire Wikipedia article. 2. Hotpot QA [31] is a multi-hop question-answering dataset where answering the question requires reasoning over information from multiple documents. 3. CNN Daily Mail [28] is a dataset of news articles and headlines. |
| Researcher Affiliation | Academia | Benjamin Cohen-Wang , Harshay Shah , Kristian Georgiev , Aleksander M adry MIT {bencw,harshay,krisgrg,madry}@mit.edu |
| Pseudocode | Yes | Algorithm 1 CONTEXTCITE |
| Open Source Code | Yes | We provide code for CONTEXTCITE at https://github. com/Madry Lab/context-cite. |
| Open Datasets | Yes | Ty Di QA [30] is a question-answering dataset... Hotpot QA [31] is a multi-hop question-answering dataset... CNN Daily Mail [28] is a dataset of news articles and headlines... MS MARCO [67] is question-answering dataset... Natural Questions [29] is a question-answering dataset... |
| Dataset Splits | No | We evaluate CONTEXTCITE on up to 1, 000 random validation examples from each of three representative benchmarks... For each of these datasets, we evaluate the F1 score of instructiontuned Llama-3-8B (Figure 6) on 1, 000 randomly sampled examples from the validation set. While the paper mentions using validation sets and random samples from them, it does not provide explicit details of train/validation/test splits, such as percentages or counts, for the datasets used in their experiments. |
| Hardware Specification | Yes | We run all experiments on a cluster of A100 GPUs. |
| Software Dependencies | No | We use the scikit-learn [64] implementation of LASSO for CONTEXTCITE... We use the off-the-shelf sentence tokenizer from the nltk library [44]. Our implementation of CONTEXTCITE is available at https://github.com/Madry Lab/context-cite. We use the implementations of language models from Hugging Face s transformers library [66]. The paper mentions specific software libraries but does not provide version numbers for scikit-learn, nltk, or the transformers library. |
| Experiment Setup | Yes | We use the scikit-learn [64] implementation of LASSO for CONTEXTCITE, always with the regularization parameter alpha set to 0.01. When splitting the context into sources or splitting a response into statements, we use the off-the-shelf sentence tokenizer from the nltk library [44]. ... We evaluate CONTEXTCITE with {32, 64, 128, 256} context ablations. |