Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Knowing When to Stop: Efficient Context Processing via Latent Sufficiency Signals

Authors: Roy Xie, Junlin Wang, Paul Rosu, Chunyuan Deng, Bolun Sun, Zihao Lin, Bhuwan Dhingra

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLa MA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33 token reduction on average.
Researcher Affiliation	Academia	Roy Xie Junlin Wang Paul Rosu Chunyuan Deng Bolun Sun Zihao Lin Bhuwan Dhingra Duke Rice JHU UC Davis
Pseudocode	No	The paper describes the methodology for dynamic context cutoff and probing LLMs for context sufficiency, explaining the process in detail with figures like Figure 2 illustrating the workflow. However, it does not present a formal 'Pseudocode' or 'Algorithm' block with structured steps in a code-like format.
Open Source Code	Yes	Code is available at https://github.com/ruoyuxie/when-to-stop.
Open Datasets	Yes	For single-hop reasoning, where answers are typically found within a single passage requiring minimal context dependency, we use SQu AD [19], a widely used dataset with questions based on Wikipedia passages; Natural Questions [11], containing questions derived from real-world search queries with answers located in a single but longer passage; and a Code Understanding dataset, where we use GPT-4o to synthetically generate multiple single-function code snippets as distractors, and use the original PCSD [26] data to create a QA task dataset requiring to first locate and then understand the relevant code. For multi-hop reasoning, which requires combining information from multiple parts of the context to arrive at the correct answer, we use Hotpot QA [31], a popular dataset with multi-hop questions requiring reasoning across multiple paragraphs from Wikipedia; MUSIQUE [24], a dataset with compositional and nested questions requiring multi-step reasoning across multiple documents; and Multi-hop Key-Value Retrieval [36], a widely adopted synthetic dataset for evaluating long-context LLMs that requires exact retrieval of dependent key-value pairs across multiple documents.
Dataset Splits	Yes	The dataset is split into training and validation sets (4:1 ratio) per task. ... Each dataset contains 600 data points, and the train-validation-test split is 80%, 10%, and 10%, respectively.
Hardware Specification	Yes	Table 10: GPU configurations used for different models in our experiments. Model GPUs Used LLa MA 3.2-1B 2 Nvidia A5000 Mistral 8B 4 Nvidia A5000 Qwen 2.5-14B 4 Nvidia A5000 LLa MA 3.3-70B 4 Nvidia A6000
Software Dependencies	No	The paper mentions using GPT-4o Mini for evaluation ('gpt-4o-mini-2024-07-18') but does not specify programming languages or library versions (e.g., Python, PyTorch versions) used for their own implementation.
Experiment Setup	Yes	The proposed dynamic context cutoff method involves three hyperparameters: the classification threshold τ, the number of attention heads used for training, and the number of classifiers in the ensemble. ... Specifically, we set k = 5 for attention heads with the highest F1 scores and train 8 lightweight classifiers for each head, selecting the top 4 with the highest AUC scores to form the ensemble. ... For all methods, including the proposed dynamic context cutoff method, we evaluate using percentage-based chunking with a 10% incremental threshold, meaning each chunk contains 10% more of the full context than the previous one. ... We fine-tune meta-llama/Llama-3.2-1B to predict the context cutoff point ... We optimize using the Adam W optimizer with a learning rate of 8.0e-05 and a batch size of 32, employing a cosine learning rate schedule with linear warmup.