Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Sufficient Context: A New Lens on Retrieval Augmented Generation Systems
Authors: Hailey Joren, Jianyi Zhang, Chun-Sung Ferng, Da-Cheng Juan, Ankur Taly, Cyrus Rashtchian
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To shed light on this, we develop a new notion of sufficient context, along with a method to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that larger models with higher baseline performance (Gemini 1.5 Pro, GPT 4o, Claude 3.5) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. |
| Researcher Affiliation | Collaboration | Hailey Joren UC San Diego EMAIL Jianyi Zhang Duke University EMAIL Chun-Sung Ferng Google EMAIL Da-Cheng Juan Google EMAIL Ankur Taly Google EMAIL Cyrus Rashtchian Google EMAIL |
| Pseudocode | No | The paper describes methods in prose but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code for our selective generation method and the prompts used in our autorater analysis are available on our github. |
| Open Datasets | Yes | We consider Fresh QA, Musique-Ans, and Hotpot QA as a representative spread of open book QA datasets. Fresh QA (Vu et al., 2023)... Musique-Ans (Trivedi et al., 2022)... Hot Pot QA (Yang et al., 2018)... |
| Dataset Splits | Yes | We used either a 2,000-example random subset sampled from the training set of the Musique-Ans dataset or from the development set of the Hot Pot QA data |
| Hardware Specification | No | The paper lists various LLM models used (e.g., GPT 4o, Gemini 1.5 Pro) and the Lo RA adaptation technique for fine-tuning, but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) on which these experiments were conducted. |
| Software Dependencies | No | The paper mentions specific language models and retrieval models used, such as 'gpt-4o-2024-08-06 model', 'gemini-1.5-pro-0514 model', 'claude-3-5-sonnet-20240620 model', 'FLAMe-RM-24B model', and 'intfloat/e5-base-v2'. However, it does not provide specific version numbers for underlying software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries. |
| Experiment Setup | Yes | For the Lo RA parameters, we set the rank to 4 and alpha to 8 for all experiments. The models were fine-tuned over 2 epochs with a batch size of 16 and a learning rate of 1 10 5. |