Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities

Authors: Jiayi Kuang, Haojing Huang, Yinghui Li, Xinnian Liang, Zhikun Xu, Yangning Li, Xiaoyu Tan, Chao Qu, Meishan Zhang, Ying Shen, Philip S Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities.
Researcher Affiliation	Collaboration	Jiayi Kuang1, Haojing Huang2, Yinghui Li2, , Xinnian Liang3, Zhikun Xu4 Yangning Li2, Xiaoyu Tan5, Chao Qu6, Meishan Zhang7, Ying Shen1,8 , Philip S. Yu9 1Sun Yat-sen University, 2Tsinghua University, 3Byte Dance Inc. 4Arizona State University, 5 Tencent Youtu Lab, 6Fudan University 7Harbin Institute of Technology (Shenzhen), 8Pengcheng Laboratory 9University of Illinois Chicago
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described in natural language.
Open Source Code	Yes	We provide our data and code in the supplementary materials for review. We will further disclose the relevant data and code in Git Hub and Huggingface to facilitate subsequent researchers to reproduce our results.
Open Datasets	Yes	For field capabilities, we collect data from current benchmarks such as MATH [66], GSM8K [67], Gaokao-Bench [70], Olympiad Bench [68], AIME, MMLU [71], and Deep Math [72]... For conceptual understanding, we extract math definitions and axioms from Natural Proofs [73]... For backward reasoning, we use a counterexample-driven reasoning statement from Counter Math [74].
Dataset Splits	Yes	Finally, we randomly divide the training set and test set with a ratio of 3:1. For conceptual understanding, we extract math definitions and axioms from Natural Proofs [73] and generate fill-in-the-blank questions. For forward reasoning, we collect questions and proofs with formal language such as Lean Workbook, and filter the questions with definite answers. For backward reasoning, we use a counterexample-driven reasoning statement from Counter Math [74]. Given data scarcity, we maintain a 1:1 train-test split for logic atomic capability to ensure evaluation robustness. Table 7: Data statistics of FIELD atomic capabilities. Field Cap. Algebra Geometry Analysis Topology Level 1 Level 2 Level 1 Level 2 Level 1 Level 2 Level 1 Level 2 Train/Test 3813/1277 4517/1505 3351/1117 3391/1331 3276/1092 4077/1358 3336/1112 3176/1058 Table 8: Data statistics of LOGICAL atomic capabilities. Logic Cap. Conceptual Understanding Forward Reasoning Backward Reasoning Attribute Description Definition Formal Language Counter-example Train/Test 1225/1217 1683/1661 1061/1032 1225/1217
Hardware Specification	Yes	Open-source evaluations are conducted on 4 L20 48GB GPUs, while proprietary models are accessed via official APIs. To examine interactions among atomic abilities, we fine-tune Qwen2.5-Math-Instruct-7B using supervised Lo RA training [76] on 4 L20 48GB GPUs, with a learning rate of 1.0e-5.
Software Dependencies	No	The paper mentions 'supervised Lo RA training [76]' as a technique used, but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). It cites a paper for LoRA, but not the software implementation.
Experiment Setup	Yes	To examine interactions among atomic abilities, we fine-tune Qwen2.5-Math-Instruct-7B using supervised Lo RA training [76] on 4 L20 48GB GPUs, with a learning rate of 1.0e-5.