Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking Chain-of-Thought from the Perspective of Self-Training

Authors: Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that the proposed method achieves significant advantages in both performance and computational efficiency. Our code is available at: https://github.com/zongqianwu/ST-COT. 5. Experiments We evaluate our Co T framework on 10 reasoning datasets
Researcher Affiliation	Academia	1School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China 2School of Computer Science and Technology, Hainan University, Haikou, China 3School of Computer Science and Engineering, Southeast University, Nanjing, China. Correspondence to: Xiaofeng Zhu <EMAIL>, Lei Feng <EMAIL>.
Pseudocode	Yes	Algorithm 1 Self-Training Require: Training dataset S = {x(t) i } 1 i B 0 t T 1 , step size η, temperature σ > 0, initial pseudo-labeler βinit 1: β0 βinit/ βinit 2: for t = 0, 1, , T 1 do 3: Generate pseudo-labels by(t) i = sgn βT t x(t) i for batch {xt i}1 i B 4: βt+1 = βt η B PB i=1 ℓ 1 σ by(t) i βT t x(t) i 5: βt+1 = βt+1/ βt+1 6: end for 7: return βT 1
Open Source Code	Yes	Our code is available at: https://github.com/zongqianwu/ST-COT.
Open Datasets	Yes	We evaluate our Co T framework on 10 reasoning datasets, including six arithmetic datasets (i.e., Multi Arith (Roy & Roth, 2016), GSM8K (Cobbe et al., 2021), Single Eq (Koncel-Kedziorski et al., 2015), Add Sub (Hosseini et al., 2014), AQu A (Ling et al., 2017), and SVAMP (Patel et al., 2021)), two commonsense reasoning datasets (i.e., Strategy QA (Geva et al., 2021) and Commonsense QA (Talmor et al., 2018)), and two symbolic reasoning datasets (i.e., Last Letter and Coin Flip (Wei et al., 2022)).
Dataset Splits	Yes	We followed literature (Kojima et al., 2022) to construct zero-shot reasoning tasks across all 10 datasets, and performed few-shot reasoning tasks on the Multi Arith and GSM8K datasets. The results of these experiments are presented in Table 1 and Table 2, respectively. Table 4. Detailed description of the datasets used in our experiments, highlighting their diversity and structure. (1) The Answer Format column indicates the type of responses expected for each dataset: N represents a numerical answer, M corresponds to selecting one option from multiple choices, Y indicates a binary answer (Yes or No), and F stands for free-form answers. (2) The Avg # words column represents the average number of words in the question texts, providing an estimate of their complexity. Data split (filename) License Single Eq questions.json Add Sub Add Sub.json Multi Arith Multi Arith.json GSM8K test.jsonl AQUA test.jsonl SVAMP SVAMP.json Commonsense QA dev rand split.jsonl Strategy QA task.json Last Letters - Coin Flip -
Hardware Specification	No	No specific hardware details are provided in the paper. The text only mentions "GPT3.5-turbo-0125 as the foundation model" but not the hardware it was run on for experiments.
Software Dependencies	No	No specific software dependencies with version numbers are provided. The paper mentions "GPT3.5-turbo-0125 as the foundation model" which is a specific model version, but not general software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	If e i δ, the predictions A i, derived from reasoning processes R i, is accepted as the final output. When the uncertainty is high and further iterations are required, the issue of high similarity between consecutive iterations may arise. ...the Jaccard index introduced to quantify this diversity. If insufficient diversity is detected, the reasoning process is resampled until predefined conditions are met. ...the maximum iteration count T is reached. ...the number of self-consistency (SC) sampling is fixed at 3 for all cases. ...We investigate the sensitivity of our method to these hyper-parameters on the AQu A dataset... First, we fixed N = 3 and varied T across the range of {1, 2, , 5}. ...Next, we set T = 3 and varied N from {1, 2, , 7}.