Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs

Authors: Yuxiang Zhang, Zhengxu Yu, Weihang Pan, Zhongming Jin, Qiang Fu, Deng Cai, Binbin Lin, Jieping Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The paper includes sections like '4 Experiments', '4.1 Experimental Setup', '4.2 Evaluation on General Reasoning Benchmarks', and '4.3 Ablation Study', along with tables and figures presenting experimental results and comparisons, indicating empirical studies and data analysis.
Researcher Affiliation	Collaboration	The authors are affiliated with 'Zhejiang University' (academic) and 'Alibaba Cloud' (industry), representing a mix of academic and industry affiliations.
Pseudocode	No	The paper describes its methodology in prose and through mathematical equations and diagrams (e.g., Figure 1), but it does not contain explicit pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code is available at https://github.com/zhangyx1122/Token Squeeze.
Open Datasets	Yes	We evaluate model performance on four benchmark datasets: AIME24, MATH500 [13], AIME25, and Live Code Bench [23].
Dataset Splits	No	The paper mentions evaluating on benchmark datasets like AIME24, MATH500, AIME25, and Live Code Bench, specifying a date range for Live Code Bench problems. However, it does not explicitly provide percentages or sample counts for training/validation/test splits for the primary datasets used in the training of Token Squeeze (which leverages self-generated data).
Hardware Specification	Yes	Training is conducted using the Py Torch framework on computing nodes equipped with 8 NVIDIA Tesla A100 GPUs.
Software Dependencies	No	The paper mentions using the 'Py Torch framework' and building on the 'DPO pipeline from the LLa MAFactory framework' but does not provide specific version numbers for these or other software components.
Experiment Setup	Yes	All models are optimized with a learning rate of 5 10 6 and a batch size of 128, using the Adam optimizer. We set the learning rate to 5 10 6, use a batch size of 128, and configure the maximum context length to 9000 tokens. The length penalty coefficient λ = 1 and η = 0.5. A KL divergence constraint of 0.005 is used, and α is set to 0.2. Decoding temperatures of 0.6 and 0.2 are specified for different benchmarks.