Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self-Verifying Reflection Helps Transformers with CoT Reasoning

Authors: Zhongwei Yu, Wannian Xia, Xue Yan, Bo Xu, Haifeng Zhang, Yali Du, Jun Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we show that tiny transformers, with only a few million parameters, benefit from self-verification in both training and reflective execution, reaching remarkable LLM-level performance in integer multiplication and Sudoku.
Researcher Affiliation	Academia	1 The Hong Kong University of Science and Technology (Guangzhou) 2 Institute of Automation, Chinese Academy of Sciences 3 King s College London 4 University College London
Pseudocode	Yes	Specifically, we analyze two basic variants of reflective reasoning in this paper: the reflective MTP and the reflective trace-back search, as described below (see pseudo-code in Appendix D.1).
Open Source Code	Yes	The source code is available at https://github.com/zwyu-ai/self-verifying-reflection-reasoning.
Open Datasets	Yes	We test tiny transformers in two reasoning tasks: The integer multiplication task [7] (Mult for short) computes the product of two integers x and y; the Sudoku task [3] fills numbers into blank positions of a 9 × 9 matrix, such that each row, column, or 3 × 3 block is a permutation of {1, . . . , 9}.
Dataset Splits	Yes	For both tasks, we divide queries into 3 levels of difficulties: The in-distribution (ID) Easy, ID Hard, and out-of-distribution (OOD) Hard. The models are trained on ID-Easy and ID-Hard problems, while tested additionally on OOD-Hard cases. ... Specifically, we have 1 ≤ d ≤ 5 or 9 ≤ b < 36 for ID Easy, 6 ≤ d ≤ 8 or 36 ≤ b < 54 for ID Hard, and 9 ≤ d ≤ 10 or 54 ≤ b < 63 for OOD Hard.
Hardware Specification	Yes	To run multiple experiments simultaneously, we utilize cloud servers with a total of 5 GPUs (one NVIDIA RTX-3090 GPU and four NVIDIA A10 GPUs).
Software Dependencies	Yes	Our implementation derives the models architectures, pretraining, and SFT from Lit GPT [1] (version 0.4.12) under Apache License 2.0.
Experiment Setup	Yes	Table 3: The main hyper-parameters used in this work. Task Mult Sudoku Model size 1M 4M 16M 1M 4M 16M Training Co T examples: NCo T 32K 36K Total pretraining tokens: Npre_tok 1B Pretraining batch size: Bpre 128 Pretraining learning rate: ηpre 0.001 0.00006 SFT batch size: BSF T 128 SFT learning rate: ηSF T 0.001 0.00006 Non-reflective SFT epochs: ESF T 5 Reflective sampling temperature: Solving τrefl:s 0.75 Reflective sampling temperature: Proposing τrefl:p 1 1.25 1.5 1 1.25 1.5 Reflective SFT epochs: ERSF T 3 PPO replay buffer size: NP P O:buf 512 GRPO replay buffer size: NGRP O:buf 1024 RL sampling interval: ERL:int 4 RL sampling temperature: Planning τRL:π 1.25 1 1.25 1.25 RL sampling temperature: Feedback τRL:πf 1 RL clipping factor: ε 0.1 RL KL-divergence factor: β 0.1 GRPO group size: G 8 RL total epochs: ERL 512 RL learning rate: ηRL 0.00005 0.00001 PPO warm-up epochs: EP P O:warmup 64 Testing first-attempt temperature: τπ:first 0 1 Testing revision temperature: τπ:rev 1 Testing verification temperature: τπf 0 Testing non-reflective steps T: 32 Testing reflective steps T: 64 RTBS width: m 4