Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

Authors: Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Xin Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard Co T. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning.
Researcher Affiliation	Academia	Zhen Zhang1 Xuehai He2 Weixiang Yan1 Ao Shen4 Chenyang Zhao3,5 Xin Eric Wang1 1University of California, Santa Barbara, 2University of California, Santa Cruz 3University of California, Los Angeles, 4Purdue University, 5LMSYS Org EMAIL, EMAIL
Pseudocode	Yes	5. schedule_batch.py, scheduler.py, scheduler_output_processor_mixin.py Purpose: State management and output tracking for soft thinking. Key changes: 1 # Pseudocode for Cold Stop 2 if entropy < entropy_threshold : 3 low_entropy_steps += 1 5 low_entropy_steps = 0 6 if low_entropy_steps >= length_threshold : 7 # Insert end -of -thinking token , switch to answer mode 8 self.output_ids [-1] = self. sampling_params . think_end_str_id
Open Source Code	Yes	Code is available here. Neur IPS Paper Checklist...5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We add all the data, and code results in the supplementary material, which makes it easy to reproduce our work. We make them public on Git Hub. Besides, we have implemented our code in the popular open-source inference framework SGLang and highlighted the key code content in the appendix A.3 to facilitate reproduction.
Open Datasets	Yes	Benchmarks. We conduct a comprehensive evaluation of our method on eight benchmark tasks, including Math500 [31], AIME 2024 [32], GSM8K [33], and GPQA-Diamond [34] in the mathematics domain, as well as Human Eval [35], MBPPMBPP [36], and Live Code Bench [37] in the programming domain. Detailed descriptions of these benchmarks are provided in Appendix A.2.
Dataset Splits	No	The paper primarily evaluates pre-trained LLMs on established benchmarks and describes sampling outputs for evaluation (e.g., "evaluated using 16 samples per problem"), but it does not provide explicit train/test/validation dataset splits (percentages or counts) for model training or custom data partitioning within the paper's text. It uses existing test sets of known benchmarks.
Hardware Specification	Yes	We implement our Soft Thinking on SGLang [39], enabling fast inference (see Appendix A.3 for implementation details). We evaluate our method on a server equipped with eight NVIDIA H100 80GB GPUs.
Software Dependencies	Yes	In this appendix, we describe the engineering modifications made to the SGLang inference engine (v0.4.6.post1) to support our proposed Soft Thinking method.
Experiment Setup	Yes	For all experiments, the maximum generation length was set to 32,768, the temperature to 0.6, top-k to 30, and top-p to 0.95, unless specified otherwise. The Standard Co T baseline was evaluated using 16 samples per problem to calculate Pass@1 accuracy, whereas the greedy Co T approach utilized a temperature of 0 with a single sample. For Soft Thinking, the concept token was determined using the top-n tokens, where n ∈ {5, 10, 15, 20, 30}, along with an entropy threshold τ chosen from {0.01, 0.05, 0.1, 0.2} and a length threshold k selected from {128, 256, 512, 1024}. All other settings were kept consistent. We find that n = 15 yields the best performance for Qw Q-32B [13], while n = 10 is optimal for Deep Seek-R1 models [38]. Results are reported based on the best-performing combinations of τ and k.