Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

Authors: Hailey Joren, Jianyi Zhang, Chun-Sung Ferng, Da-Cheng Juan, Ankur Taly, Cyrus Rashtchian

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To shed light on this, we develop a new notion of sufficient context, along with a method to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that larger models with higher baseline performance (Gemini 1.5 Pro, GPT 4o, Claude 3.5) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not.
Researcher Affiliation	Collaboration	Hailey Joren UC San Diego EMAIL Jianyi Zhang Duke University EMAIL Chun-Sung Ferng Google EMAIL Da-Cheng Juan Google EMAIL Ankur Taly Google EMAIL Cyrus Rashtchian Google EMAIL
Pseudocode	No	The paper describes methods in prose but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code for our selective generation method and the prompts used in our autorater analysis are available on our github.
Open Datasets	Yes	We consider Fresh QA, Musique-Ans, and Hotpot QA as a representative spread of open book QA datasets. Fresh QA (Vu et al., 2023)... Musique-Ans (Trivedi et al., 2022)... Hot Pot QA (Yang et al., 2018)...
Dataset Splits	Yes	We used either a 2,000-example random subset sampled from the training set of the Musique-Ans dataset or from the development set of the Hot Pot QA data
Hardware Specification	No	The paper lists various LLM models used (e.g., GPT 4o, Gemini 1.5 Pro) and the Lo RA adaptation technique for fine-tuning, but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) on which these experiments were conducted.
Software Dependencies	No	The paper mentions specific language models and retrieval models used, such as 'gpt-4o-2024-08-06 model', 'gemini-1.5-pro-0514 model', 'claude-3-5-sonnet-20240620 model', 'FLAMe-RM-24B model', and 'intfloat/e5-base-v2'. However, it does not provide specific version numbers for underlying software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries.
Experiment Setup	Yes	For the Lo RA parameters, we set the rank to 4 and alpha to 8 for all experiments. The models were fine-tuned over 2 epochs with a batch size of 16 and a learning rate of 1 10 5.