Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Inference-Time Hyper-Scaling with KV Cache Compression

Authors: Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo Maria Ponti

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate inference-time scaling capabilities of efficient attention (including DMS) on reasoning datasets such as MATH-500 (Hendrycks et al., 2021b) and AIME 2024 for math, GPQA Diamond (Rein et al., 2024) for hard sciences, and Live Code Bench (Jain et al., 2025) for coding. We find that DMS significantly improves the Pareto frontiers across various model sizes and datasets, outperforming vanilla LLMs in both memory reads (which is a proxy for runtime) and peak memory use.
Researcher Affiliation Collaboration Adrian Ła ncucki Konrad Staniszewski Piotr Nawrot Edoardo M. Ponti NVIDIA University of Warsaw University of Edinburgh
Pseudocode No The paper describes the Dynamic Memory Sparsification (DMS) procedure with textual explanations and a conceptual diagram (Figure 2), but does not present a formal pseudocode or algorithm block.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: While we do not provide code with sufficient instructions for reproduction, we attach to the submission code showing how to implement DMS using the public codebase for Dynamic Memory Compression.
Open Datasets Yes We evaluate inference-time scaling capabilities of efficient attention (including DMS) on reasoning datasets such as MATH-500 (Hendrycks et al., 2021b) and AIME 2024 for math, GPQA Diamond (Rein et al., 2024) for hard sciences, and Live Code Bench (Jain et al., 2025) for coding. ... For GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021a), ARC-Challenge (Clark et al., 2018), and Hella Swag (Zellers et al., 2019), we use the Language Model Evaluation Harness (Gao et al., 2024), version 0.4.3.
Dataset Splits No The paper uses standard benchmark datasets and mentions few-shot prompting for evaluation (e.g., "8-shot prompting on GSM8K"), but does not explicitly provide details about the training, validation, and test splits used for its experiments. For instance, it doesn't specify percentages, counts, or references to a particular predefined split methodology for the datasets beyond stating what the datasets are.
Hardware Specification Yes Next, we investigate how the reduced memory reads and peak memory of DMS translate into latency and throughput, measured on an NVIDIA H100 SXM GPU in 16-bit precision... All models are trained on NVIDIA H100 GPUs using Megatron-LM (NVIDIA, 2024) in bfloat16 precision, with optimizer states stored in FP32.
Software Dependencies Yes For GSM8K (Cobbe et al., 2021), MMLU (Hendrycks et al., 2021a), ARC-Challenge (Clark et al., 2018), and Hella Swag (Zellers et al., 2019), we use the Language Model Evaluation Harness (Gao et al., 2024), version 0.4.3. ... For AIME24, GPQA Diamond (Rein et al., 2024), and MATH-500 (Lightman et al., 2024), we evaluate models using the Search and Learn framework (version 0.1.0) (Snell et al., 2024; Beeching et al., 2024), with math-parsing functionality derived from Math Verify (version 1.0.0) (Kydlíˇcek and Gandenberger, 2025).
Experiment Setup Yes We use a batch size of 1024 following the original Llama recipe (Touvron et al., 2023). Context lengths are set to 4096 tokens for Llama 3.2 1B Instruct and Llama 2 7B models, and 8192 tokens for Llama 3.1 8B and R1-distilled models... For Qwen3-8B we use 256 batch size and 32K context length. ... Default DMS Configuration Unless otherwise specified, all DMS models use delayed eviction with a sliding window of 256 tokens and increment the compression ratio by 1 every 100 training steps. ... During training we follow DMC and apply a one-sided ℓ1 loss term... the target compression α is linearly annealed from 0 to 1 1 CR. We train the model using a logit distillation loss LD loss (Hinton et al., 2015)...