Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Rethinking Chain-of-Thought from the Perspective of Self-Training
Authors: Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that the proposed method achieves significant advantages in both performance and computational efficiency. Our code is available at: https://github.com/zongqianwu/ST-COT. 5. Experiments We evaluate our Co T framework on 10 reasoning datasets |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China 2School of Computer Science and Technology, Hainan University, Haikou, China 3School of Computer Science and Engineering, Southeast University, Nanjing, China. Correspondence to: Xiaofeng Zhu <EMAIL>, Lei Feng <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Self-Training Require: Training dataset S = {x(t) i } 1 i B 0 t T 1 , step size η, temperature σ > 0, initial pseudo-labeler βinit 1: β0 βinit/ βinit 2: for t = 0, 1, , T 1 do 3: Generate pseudo-labels by(t) i = sgn βT t x(t) i for batch {xt i}1 i B 4: βt+1 = βt η B PB i=1 ℓ 1 σ by(t) i βT t x(t) i 5: βt+1 = βt+1/ βt+1 6: end for 7: return βT 1 |
| Open Source Code | Yes | Our code is available at: https://github.com/zongqianwu/ST-COT. |
| Open Datasets | Yes | We evaluate our Co T framework on 10 reasoning datasets, including six arithmetic datasets (i.e., Multi Arith (Roy & Roth, 2016), GSM8K (Cobbe et al., 2021), Single Eq (Koncel-Kedziorski et al., 2015), Add Sub (Hosseini et al., 2014), AQu A (Ling et al., 2017), and SVAMP (Patel et al., 2021)), two commonsense reasoning datasets (i.e., Strategy QA (Geva et al., 2021) and Commonsense QA (Talmor et al., 2018)), and two symbolic reasoning datasets (i.e., Last Letter and Coin Flip (Wei et al., 2022)). |
| Dataset Splits | Yes | We followed literature (Kojima et al., 2022) to construct zero-shot reasoning tasks across all 10 datasets, and performed few-shot reasoning tasks on the Multi Arith and GSM8K datasets. The results of these experiments are presented in Table 1 and Table 2, respectively. Table 4. Detailed description of the datasets used in our experiments, highlighting their diversity and structure. (1) The Answer Format column indicates the type of responses expected for each dataset: N represents a numerical answer, M corresponds to selecting one option from multiple choices, Y indicates a binary answer (Yes or No), and F stands for free-form answers. (2) The Avg # words column represents the average number of words in the question texts, providing an estimate of their complexity. Data split (filename) License Single Eq questions.json Add Sub Add Sub.json Multi Arith Multi Arith.json GSM8K test.jsonl AQUA test.jsonl SVAMP SVAMP.json Commonsense QA dev rand split.jsonl Strategy QA task.json Last Letters - Coin Flip - |
| Hardware Specification | No | No specific hardware details are provided in the paper. The text only mentions "GPT3.5-turbo-0125 as the foundation model" but not the hardware it was run on for experiments. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. The paper mentions "GPT3.5-turbo-0125 as the foundation model" which is a specific model version, but not general software dependencies like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | If e i δ, the predictions A i, derived from reasoning processes R i, is accepted as the final output. When the uncertainty is high and further iterations are required, the issue of high similarity between consecutive iterations may arise. ...the Jaccard index introduced to quantify this diversity. If insufficient diversity is detected, the reasoning process is resampled until predefined conditions are met. ...the maximum iteration count T is reached. ...the number of self-consistency (SC) sampling is fixed at 3 for all cases. ...We investigate the sensitivity of our method to these hyper-parameters on the AQu A dataset... First, we fixed N = 3 and varied T across the range of {1, 2, , 5}. ...Next, we set T = 3 and varied N from {1, 2, , 7}. |