Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs
Authors: Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu, Deng Cai, Binbin Lin, Jieping Ye
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The paper includes sections like '4 Experiments', '4.1 Experimental Setup', '4.2 Evaluation on General Reasoning Benchmarks', and '4.3 Ablation Study', along with tables and figures presenting experimental results and comparisons, indicating empirical studies and data analysis. |
| Researcher Affiliation | Collaboration | The authors are affiliated with 'Zhejiang University' (academic) and 'Alibaba Cloud' (industry), representing a mix of academic and industry affiliations. |
| Pseudocode | No | The paper describes its methodology in prose and through mathematical equations and diagrams (e.g., Figure 1), but it does not contain explicit pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Our code is available at https://github.com/zhangyx1122/Token Squeeze. |
| Open Datasets | Yes | We evaluate model performance on four benchmark datasets: AIME24, MATH500 [13], AIME25, and Live Code Bench [23]. |
| Dataset Splits | No | The paper mentions evaluating on benchmark datasets like AIME24, MATH500, AIME25, and Live Code Bench, specifying a date range for Live Code Bench problems. However, it does not explicitly provide percentages or sample counts for training/validation/test splits for the primary datasets used in the training of Token Squeeze (which leverages self-generated data). |
| Hardware Specification | Yes | Training is conducted using the Py Torch framework on computing nodes equipped with 8 NVIDIA Tesla A100 GPUs. |
| Software Dependencies | No | The paper mentions using the 'Py Torch framework' and building on the 'DPO pipeline from the LLa MAFactory framework' but does not provide specific version numbers for these or other software components. |
| Experiment Setup | Yes | All models are optimized with a learning rate of 5 10 6 and a batch size of 128, using the Adam optimizer. We set the learning rate to 5 10 6, use a batch size of 128, and configure the maximum context length to 9000 tokens. The length penalty coefficient λ = 1 and η = 0.5. A KL divergence constraint of 0.005 is used, and α is set to 0.2. Decoding temperatures of 0.6 and 0.2 are specified for different benchmarks. |