Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Chain of Execution Supervision Promotes General Reasoning in Large Language Models
Authors: Nuo Chen, Zehua Li, Keqin Bao, Junyang Lin, Dayiheng Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across four base models (LLa MA 3, LLa MA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. |
| Researcher Affiliation | Collaboration | Nuo Chen Zehua Li Keqin Bao Junyang Lin Dayiheng Liu Qwen Team, Alibaba Hong Kong University of Science and Technology (Guangzhou) University of Science and Technology of China EMAIL |
| Pseudocode | Yes | Figure 2: A classical DFS algorithm example of Co E in Trace Pile. More cases are in Appendix C. |
| Open Source Code | No | The paper states 'all training is conducted using the LLa MA-Factory framework' but does not provide a specific link or explicit statement for the release of their own source code for the Trace Pile methodology. |
| Open Datasets | No | The paper introduces 'Trace Pile' as a new large-scale dataset constructed by the authors, describing its sources and composition (Section 2, Table 2). While it cites public datasets used to build Trace Pile, it does not provide concrete access information (e.g., URL, DOI) for the Trace Pile dataset itself. |
| Dataset Splits | Yes | The paper evaluates across 20 benchmarks spanning four major reasoning domains, including well-known datasets such as GSM8K, MATH, and MMLU-STEM. It also states: 'We aggregate these datasets through three public evaluation toolkits: Open Compass[9], Qwen2.5-Math [50], and Zero Eval [28], ensuring consistency and reproducibility across experiments.' |
| Hardware Specification | Yes | For all training experiments including both continue-pretraining and instruction tuning we use 16 H800-80GB GPUs with a batch size of 512, a maximum sequence length of 8192 tokens, and 3 training epochs. |
| Software Dependencies | No | The paper mentions 'all training is conducted using the LLa MA-Factory framework' and that datasets are aggregated 'through three public evaluation toolkits: Open Compass[9], Qwen2.5-Math [50], and Zero Eval [28],' but it does not specify version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | For all training experiments including both continue-pretraining and instruction tuning we use 16 H800-80GB GPUs with a batch size of 512, a maximum sequence length of 8192 tokens, and 3 training epochs. The learning rate is set to 1e-5, and all training is conducted using the LLa MA-Factory framework. |