Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Understanding Synthetic Context Extension via Retrieval Heads

Authors: Xinyu Zhao, Fangcong Yin, Greg Durrett

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of needle concepts to be retrieved and diversity of the surrounding haystack context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2025). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning realworld LLM capabilities over long contexts.
Researcher Affiliation	Academia	1Department of Computer Science, The University of Texas at Austin, Texas, USA.
Pseudocode	No	The paper describes methods and procedures in paragraph text and provides prompts used for data generation, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper mentions using existing libraries and tools like Huggingface TRL, PEFT, LoRA, Baukit, Flash Attention 2, and Deep Speed for fine-tuning and analysis. However, it does not explicitly state that the authors are releasing their own implementation code for the methodology described in this paper, nor does it provide a direct link to a repository for their specific code. The provided link to Baukit is for a tool they used, not their own project's code.
Open Datasets	Yes	We use the following datasets. MDQA (Liu et al., 2024a): MDQA is a multi-document question answering (QA) dataset... Mu Si Que (Trivedi et al., 2022): Mu Si Que is a multi-hop QA dataset... Summ Hay Citation (Laban et al., 2024): Summary of a Haystack (Summ Hay) is a long-context retrieval dataset... We extend the original MDQA dataset in 4K context to 32K context by retrieving additional distractor paragraphs from Natural Questions-Open (Kwiatkowski et al., 2019; Lee et al., 2019) with Contriever (Izacard et al., 2022).
Dataset Splits	Yes	We use 1400 examples for training MDQA models, 400 examples for Mu Si Que models, and 400 examples for Summ Hay Citation models. Each dataset is partitioned in to a 90/10 train/validation split.
Hardware Specification	Yes	We enable Flash Attention 2 and Deep Speed and use a single NVIDIA H100 GPU (96GB) for each training run.
Software Dependencies	No	For fine-tuning, we use the Huggingface TRL (von Werra et al., 2020) and PEFT (Mangrulkar et al., 2022) libraries to fine-tune attention heads with Lo RA (Hu et al., 2022) (rank = 8 and alpha = 8) using a batch size of 1 and 4 gradient accumulation steps. We enable Flash Attention 2 and Deep Speed. While these libraries are mentioned, specific version numbers for TRL, PEFT, LoRA, Flash Attention 2, or Deep Speed as used in their experiments are not provided in the text.
Experiment Setup	Yes	For fine-tuning, we use the Huggingface TRL (von Werra et al., 2020) and PEFT (Mangrulkar et al., 2022) libraries to fine-tune attention heads with Lo RA (Hu et al., 2022) (rank = 8 and alpha = 8) using a batch size of 1 and 4 gradient accumulation steps. To extend models from their original 8K pretrained context length to 32K, we follow (Gradient, 2024) in calculating new Ro PE (Su et al., 2024), theta values, using 6315088 for Llama-3-8B-Instruct and 59300 for Mistral-7B-Instruct-v0.1. We scale the sliding window accordingly for Mistral-7B-Instruct-v0.1 to 16k context.