Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

Authors: Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, Jundong Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the superior performance of Sem Co T compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at https://github.com/Yinhan He123/Sem Co T.
Researcher Affiliation	Collaboration	Yinhan He University of Virginia Charlottesville, VA EMAIL Wendy Zheng University of Virginia Charlottesville, VA EMAIL Yaochen Zhu University of Virginia Charlottesville, VA EMAIL Zaiyi Zheng University of Virginia Charlottesville, VA EMAIL Lin Su Linked In Inc. Sunnyvale, CA EMAIL Sriram Vasudevan Linked In Inc. Sunnyvale, CA EMAIL Qi Guo Linked In Inc. Sunnyvale, CA EMAIL Liangjie Hong Linked In Inc. Sunnyvale, CA EMAIL Jundong Li University of Virginia. Charlottesville, VA EMAIL
Pseudocode	No	The paper describes the methodology in Section 3 and illustrates it with Figure 2, but it does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	Our code can be found at https://github.com/Yinhan He123/Sem Co T.
Open Datasets	Yes	Datasets. We adopt five representative datasets from three different semantic domains used for benchmarking Co T performance. Specifically, we apply mathematical reasoning datasets GSM8K [5], SVAMP [36], Multi Arith [4, 39], commonsense reasoning dataset Commonsense QA [44], and symbolic reasoning dataset Coin Flip [21]. Please see Appendix C for the metadata of the five used datasets. ...Table 4: License Information for Hugging Face Datasets Dataset License Information SVAMP MIT License Multi Arith CC BY 4.0 Commonsense QA MIT License Coin Flip MIT License GSM8K MIT License
Dataset Splits	Yes	Table 3: Metadata for benchmark datasets Dataset Train Size Test Size Reasoning Type GSM8K [5] 7,500 1,000 Arithmetic SVAMP [36] 700 300 Arithmetic Multi Arith [4, 39] 420 180 Arithmetic Commonsense QA [44] 9,741 1,140 Commonsense Coin Flip [21] 20,000 3,330 Symbolic
Hardware Specification	Yes	Hardware Information. We perform all experiments on multiple machines with NVIDIA H100 80GB GPUs running CUDA 12.4.
Software Dependencies	Yes	Our Sem Co T is implemented with Py Torch [35] and Huggingface [50] training pipeline. We list the hyperparameters settings in the Git Hub repository (found in utils/utils.py).
Experiment Setup	Yes	Implementation Details. We set the output embedding dimension of the customized sentence transformer to 768. The number of implicit tokens during training is five, and, during evaluation, it is set to one. We optimize both the customized sentence transformer and the implicit reasoning generator with Adam W [27] using the best hyperparameters found. For inference, we allow up to thirty answer tokens to be generated to enforce the LLM to generate the answer instead of the Co T. ...Our Sem Co T is implemented with Py Torch [35] and Huggingface [50] training pipeline. We list the hyperparameters settings in the Git Hub repository (found in utils/utils.py). Table 1 shows the average accuracy and time measurements over three independent rounds for each method. During training, the baselines and Sem Co T are allotted to five implicit reasoning tokens.