Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Authors: Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Ruihua Song, Jian Luan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across four mathematical reasoning datasets demonstrate that Co La R achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit Co T method.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2Mi LM Plus, Xiaomi Inc., Beijing, China
Pseudocode	No	The paper describes the methodology in text and through diagrams (Figure 2) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Project page: https://github.com/xiaomi-research/colar.
Open Datasets	Yes	Our extensive evaluations on four grade-school level mathematical reasoning datasets (GSM8k [4], GSM8k-hard [9], SVAMP [20], and Multi Arith [21]) demonstrate that Co La R achieves a 14.1% improvement in accuracy compared to state-of-the-art baseline methods at comparable compression ratios.
Dataset Splits	Yes	GSM8k-Aug comprises approximately 385k training samples and 1k test samples. ... Since the original MATH dataset does not provide an official validation set, we randomly shuffle the training set and allocate 10% of the samples for validation purposes.
Hardware Specification	Yes	For SFT experiments, we leverage Distributed Data Parallel across eight A100 GPUs with a total batch size of 256. The RL experiments are conducted on a single A100 GPU with a rollout batch size of 8, optimizer step batch size of 4, group size G of 8, and clip ̈ of 0.2.
Software Dependencies	No	The paper mentions 'Python, CUDA, Py Torch, and Num Py' as libraries used for reproducibility but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We utilize the Adam W optimizer with a weight decay of 1e-2 throughout our experiments. The learning rate is set at 1e-4 for supervised fine-tuning (SFT) and 1e-6 for reinforcement learning (RL). For SFT experiments, we leverage Distributed Data Parallel across eight A100 GPUs with a total batch size of 256. The RL experiments are conducted on a single A100 GPU with a rollout batch size of 8, optimizer step batch size of 4, group size G of 8, and clip ̈ of 0.2.