Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains
Authors: Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across four mathematical reasoning datasets demonstrate that Co La R achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit Co T method. |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Mi LM Plus, Xiaomi Inc., Beijing, China |
| Pseudocode | No | The paper describes the methodology in text and through diagrams (Figure 2) but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project page: https://github.com/xiaomi-research/colar. |
| Open Datasets | Yes | Our extensive evaluations on four grade-school level mathematical reasoning datasets (GSM8k [4], GSM8k-hard [9], SVAMP [20], and Multi Arith [21]) demonstrate that Co La R achieves a 14.1% improvement in accuracy compared to state-of-the-art baseline methods at comparable compression ratios. |
| Dataset Splits | Yes | GSM8k-Aug comprises approximately 385k training samples and 1k test samples. ... Since the original MATH dataset does not provide an official validation set, we randomly shuffle the training set and allocate 10% of the samples for validation purposes. |
| Hardware Specification | Yes | For SFT experiments, we leverage Distributed Data Parallel across eight A100 GPUs with a total batch size of 256. The RL experiments are conducted on a single A100 GPU with a rollout batch size of 8, optimizer step batch size of 4, group size G of 8, and clip ̈ of 0.2. |
| Software Dependencies | No | The paper mentions 'Python, CUDA, Py Torch, and Num Py' as libraries used for reproducibility but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We utilize the Adam W optimizer with a weight decay of 1e-2 throughout our experiments. The learning rate is set at 1e-4 for supervised fine-tuning (SFT) and 1e-6 for reinforcement learning (RL). For SFT experiments, we leverage Distributed Data Parallel across eight A100 GPUs with a total batch size of 256. The RL experiments are conducted on a single A100 GPU with a rollout batch size of 8, optimizer step batch size of 4, group size G of 8, and clip ̈ of 0.2. |