Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

Authors: Nuo Chen, Zehua Li, Keqin Bao, Junyang Lin, Dayiheng Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across four base models (LLa MA 3, LLa MA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements.
Researcher Affiliation	Collaboration	Nuo Chen Zehua Li Keqin Bao Junyang Lin Dayiheng Liu Qwen Team, Alibaba Hong Kong University of Science and Technology (Guangzhou) University of Science and Technology of China EMAIL
Pseudocode	Yes	Figure 2: A classical DFS algorithm example of Co E in Trace Pile. More cases are in Appendix C.
Open Source Code	No	The paper states 'all training is conducted using the LLa MA-Factory framework' but does not provide a specific link or explicit statement for the release of their own source code for the Trace Pile methodology.
Open Datasets	No	The paper introduces 'Trace Pile' as a new large-scale dataset constructed by the authors, describing its sources and composition (Section 2, Table 2). While it cites public datasets used to build Trace Pile, it does not provide concrete access information (e.g., URL, DOI) for the Trace Pile dataset itself.
Dataset Splits	Yes	The paper evaluates across 20 benchmarks spanning four major reasoning domains, including well-known datasets such as GSM8K, MATH, and MMLU-STEM. It also states: 'We aggregate these datasets through three public evaluation toolkits: Open Compass[9], Qwen2.5-Math [50], and Zero Eval [28], ensuring consistency and reproducibility across experiments.'
Hardware Specification	Yes	For all training experiments including both continue-pretraining and instruction tuning we use 16 H800-80GB GPUs with a batch size of 512, a maximum sequence length of 8192 tokens, and 3 training epochs.
Software Dependencies	No	The paper mentions 'all training is conducted using the LLa MA-Factory framework' and that datasets are aggregated 'through three public evaluation toolkits: Open Compass[9], Qwen2.5-Math [50], and Zero Eval [28],' but it does not specify version numbers for these software components or any other libraries.
Experiment Setup	Yes	For all training experiments including both continue-pretraining and instruction tuning we use 16 H800-80GB GPUs with a batch size of 512, a maximum sequence length of 8192 tokens, and 3 training epochs. The learning rate is set to 1e-5, and all training is conducted using the LLa MA-Factory framework.