reproducibilityindex.ai

ContextCite: Attributing Model Generation to Context

Authors: Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, Aleksander Madry

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate CONTEXTCITE on up to 1, 000 random validation examples from each of three representative benchmarks: 1. Ty Di QA [30] is a question-answering dataset in which the context is an entire Wikipedia article. 2. Hotpot QA [31] is a multi-hop question-answering dataset where answering the question requires reasoning over information from multiple documents. 3. CNN Daily Mail [28] is a dataset of news articles and headlines.
Researcher Affiliation	Academia	Benjamin Cohen-Wang , Harshay Shah , Kristian Georgiev , Aleksander M adry MIT {bencw,harshay,krisgrg,madry}@mit.edu
Pseudocode	Yes	Algorithm 1 CONTEXTCITE
Open Source Code	Yes	We provide code for CONTEXTCITE at https://github. com/Madry Lab/context-cite.
Open Datasets	Yes	Ty Di QA [30] is a question-answering dataset... Hotpot QA [31] is a multi-hop question-answering dataset... CNN Daily Mail [28] is a dataset of news articles and headlines... MS MARCO [67] is question-answering dataset... Natural Questions [29] is a question-answering dataset...
Dataset Splits	No	We evaluate CONTEXTCITE on up to 1, 000 random validation examples from each of three representative benchmarks... For each of these datasets, we evaluate the F1 score of instructiontuned Llama-3-8B (Figure 6) on 1, 000 randomly sampled examples from the validation set. While the paper mentions using validation sets and random samples from them, it does not provide explicit details of train/validation/test splits, such as percentages or counts, for the datasets used in their experiments.
Hardware Specification	Yes	We run all experiments on a cluster of A100 GPUs.
Software Dependencies	No	We use the scikit-learn [64] implementation of LASSO for CONTEXTCITE... We use the off-the-shelf sentence tokenizer from the nltk library [44]. Our implementation of CONTEXTCITE is available at https://github.com/Madry Lab/context-cite. We use the implementations of language models from Hugging Face s transformers library [66]. The paper mentions specific software libraries but does not provide version numbers for scikit-learn, nltk, or the transformers library.
Experiment Setup	Yes	We use the scikit-learn [64] implementation of LASSO for CONTEXTCITE, always with the regularization parameter alpha set to 0.01. When splitting the context into sources or splitting a response into statements, we use the off-the-shelf sentence tokenizer from the nltk library [44]. ... We evaluate CONTEXTCITE with {32, 64, 128, 256} context ablations.