Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

Authors: Nuo Chen, Zehua Li, Keqin Bao, Junyang Lin, Dayiheng Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across four base models (LLa MA 3, LLa MA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements.
Researcher Affiliation Collaboration Nuo Chen Zehua Li Keqin Bao Junyang Lin Dayiheng Liu Qwen Team, Alibaba Hong Kong University of Science and Technology (Guangzhou) University of Science and Technology of China EMAIL
Pseudocode Yes Figure 2: A classical DFS algorithm example of Co E in Trace Pile. More cases are in Appendix C.
Open Source Code No The paper states 'all training is conducted using the LLa MA-Factory framework' but does not provide a specific link or explicit statement for the release of their own source code for the Trace Pile methodology.
Open Datasets No The paper introduces 'Trace Pile' as a new large-scale dataset constructed by the authors, describing its sources and composition (Section 2, Table 2). While it cites public datasets used to build Trace Pile, it does not provide concrete access information (e.g., URL, DOI) for the Trace Pile dataset itself.
Dataset Splits Yes The paper evaluates across 20 benchmarks spanning four major reasoning domains, including well-known datasets such as GSM8K, MATH, and MMLU-STEM. It also states: 'We aggregate these datasets through three public evaluation toolkits: Open Compass[9], Qwen2.5-Math [50], and Zero Eval [28], ensuring consistency and reproducibility across experiments.'
Hardware Specification Yes For all training experiments including both continue-pretraining and instruction tuning we use 16 H800-80GB GPUs with a batch size of 512, a maximum sequence length of 8192 tokens, and 3 training epochs.
Software Dependencies No The paper mentions 'all training is conducted using the LLa MA-Factory framework' and that datasets are aggregated 'through three public evaluation toolkits: Open Compass[9], Qwen2.5-Math [50], and Zero Eval [28],' but it does not specify version numbers for these software components or any other libraries.
Experiment Setup Yes For all training experiments including both continue-pretraining and instruction tuning we use 16 H800-80GB GPUs with a batch size of 512, a maximum sequence length of 8192 tokens, and 3 training epochs. The learning rate is set to 1e-5, and all training is conducted using the LLa MA-Factory framework.