Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Authors: Jiwon Song, Dongwon Jo, Yulhwa Kim, jae-joon kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that RPC improves generation throughput of Qw Q-32B by up to 1.60 compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. ... 4 Experiments
Researcher Affiliation Academia 1 Seoul National University 2 Sungkyunkwan University EMAIL {yulhwakim}@skku.edu
Pseudocode Yes Algorithm 1: Important token selection algorithm of RPC Input: generation step t, query of step t qt, KV cache CKV , selector query cache CQ Output: updated CKV , updated CQ // Cache selector queries if (t R) 0 and (t R) mod P < R then Append qt to CQ // Compress KV cache every P steps if (t R) 0 and (t R) mod P = 0 then s Importance of tokens in CKV ; // Compute importance score Ctmp KV cache with top N P c importance scores ; // Retain important KV cache CKV Ctmp CKV [ R :] ; // Retain KV cache of selector window CQ [] ; // Reset selector query cache return CKV , CQ
Open Source Code Yes Our code is available at https://github.com/jiwonsong-dev/Reasoning Path Compression.
Open Datasets Yes Our evaluation covers three reasoning-intensive benchmarks: American Invitational Mathematics Examination (AIME) 2024 for mathematical reasoning, Live Code Bench [25] for coding tasks, and IFEval [31] for instruction following.
Dataset Splits No Our evaluation covers three reasoning-intensive benchmarks: American Invitational Mathematics Examination (AIME) 2024 for mathematical reasoning, Live Code Bench [25] for coding tasks, and IFEval [31] for instruction following. We sample k completions per instance to compute pass@1, where k = 8 for AIME 2024, k = 4 for Live Code Bench, and k = 1 for IFEval, respectively.
Hardware Specification Yes Throughput and memory measurements for Deep Seek-R1-Distill-Qwen-7B are obtained on a single NVIDIA H100 SXM GPU, while Qw Q-32B evaluations are conducted on four H100 SXM GPUs.
Software Dependencies Yes Our implementation uses Flash Attention-2 [32] as the attention kernel for all decoding layers and is built on top of Hugging Face Transformers v4.45 [33].
Experiment Setup Yes All outputs are generated using nucleus sampling with temperature = 0.6 and top-p = 0.95. For Qw Q-32B, we additionally set top-k = 40 following the model s recommended decoding configuration. The maximum number of generated tokens is capped at 32768, following the default settings of tested models. ... Unless otherwise specified, we use the following default RPC hyperparameters: We set the selector window size R to 32 and apply local pooling with window size w = 3 for importance smoothing. The compression interval P is set to 1024 or 4096. The target compression ratio is set to 4 by default.