Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

Authors: Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive evaluations demonstrate REFORM s effectiveness across various long-context understanding tasks. In needle-in-a-haystack tests, REFORM achieves perfect recall for contexts up to 1 million tokens at various depths. On more complex benchmarks, REFORM significantly outperforms existing methods, achieving over 52% performance gain on RULER and 34% on BABILong at 1M context lengths with the Mistral-Ne Mo-Instruct-2407 [2] model, compared to the best-performing baselines. On -Bench, REFORM achieves 50.2% average accuracy with the same model, substantially exceeding the baseline performance of 37.6%. REFORM also outperforms the baselines in Repo Eval, scoring 65.3% performance with Qwen2.5-Coder-1.5B-Instruct on API-level code completion while the best baseline gives 61.8%.
Researcher Affiliation	Collaboration	Woomin Song1 , Sai Muralidhar Jayanthi2, Srikanth Ronanki2, Kanthashree Mysore Sathyendra2, Jinwoo Shin1, Aram Galstyan2, Shubham Katiyar2, Sravan Babu Bodapati2 1KAIST, 2Amazon AGI
Pseudocode	Yes	A REFORM algorithm We illustrate the overall process of REFORM through a pseudocode in Algorithm 1. Algorithm 1 Overview of REFORM
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We do not provide the code at the submission time.
Open Datasets	Yes	Our extensive evaluations demonstrate REFORM s effectiveness across various long-context understanding tasks. In needle-in-a-haystack tests, REFORM achieves perfect recall for contexts up to 1 million tokens at various depths. On more complex benchmarks, REFORM significantly outperforms existing methods, achieving over 52% performance gain on RULER and 34% on BABILong at 1M context lengths with the Mistral-Ne Mo-Instruct-2407 [2] model, compared to the best-performing baselines. On -Bench, REFORM achieves 50.2% average accuracy with the same model, substantially exceeding the baseline performance of 37.6%. REFORM also outperforms the baselines in Repo Eval, scoring 65.3% performance with Qwen2.5-Coder-1.5B-Instruct on API-level code completion while the best baseline gives 61.8%.
Dataset Splits	Yes	To evaluate the precise retrieval performance of our approach, we employ the Needle-in-a-Haystack (NIAH) benchmark [22]. In Figure 3, we measure the performance of our method at different context lengths and needle depths. Our method demonstrates perfect performance across all setups up to 1M tokens, highlighting our method s robustness in handling extremely long contexts while maintaining precise retrieval performance. ... We report the averaged performance of all tasks at different context lengths. ... To evaluate the embeddings, we constructed a synthetic dataset based on multi-hop question answering. In this setup, we embedded documents from the Hot Pot QA dataset [45] at random positions within a long text corpus derived from the Wiki Text dataset [16]. Each question was appended at the end of the context, and token-level labels were created, where tokens from the golden documents were marked as ground truth. All samples were designed to be 8k tokens long, which is within the context window of the Mistral-7B-Instruct-v0.2 model. ... Performance measurement. Retrieval performance was quantified using the Mean Normalized Rank (MNR), which is calculated as the average normalized rank of the golden tokens. Lower scores correspond to higher performance, as the golden tokens have a high rank.
Hardware Specification	Yes	(b) Efficiency Analysis. Time (s) Memory (GB) ... All measurements are made with the Mistral-Ne Mo-Instruct-2407 model on a single H100 GPU, and are averaged over 10 samples. ... All measurements are made on a single H100 GPU, and we apply Flash Attention 2 [48] for all measurements.
Software Dependencies	Yes	All measurements are made on a single H100 GPU, and we apply Flash Attention 2 [48] for all measurements.
Experiment Setup	Yes	For all text-based experiments, we use Mistral-Ne Mo-Instruct-2407 [2] and Qwen2.5-7B-Instruct [21] models. For code completion experiments, we use Qwen2.5-Coder-1.5B and 7B models. For multi-modal experiments, we use Pixtral-12B-2409 [9]. All recurrent baselines operate with a KV cache budget and chunk size of 32k tokens, and Inf LLM also uses 32k active KV cache budget. We always keep the initial and recent 256 tokens in cache for all baselines and REFORM to maintain the coherency of the text. For REFORM, we use a recomputation budget of 8k tokens for Mistral-Nemo Instruct-2407 and 16k tokens for all other models. We provide more details in Section C.1.