Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

Authors: Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings reveal that distilled reasoning models (e.g., Deep Seek R1-Distill-Qwen-32B) exhibit significantly more recurrent cycles (about 5 per sample), substantially larger graph diameters, and pronounced small-world characteristics (about 6x) compared to their base counterparts. Notably, these structural advantages grow with task difficulty and model capacity, with cycle detection peaking at the 14B scale and exploration diameter maximized in the 32B variant, correlating positively with accuracy. Furthermore, we show that supervised fine-tuning on an improved dataset systematically expands reasoning graph diameters in tandem with performance gains, offering concrete guidelines for dataset design aimed at boosting reasoning capabilities.
Researcher Affiliation	Collaboration	1The University of Tokyo, 2Google Deep Mind EMAIL
Pseudocode	Yes	Simple implementations of each method are provided in Appendix C.
Open Source Code	Yes	Implementation available here: https://github.com/gouki510/Topology_of_Reasoning
Open Datasets	Yes	across multiple tasks (GSM8K, MATH500, AIME 2024). We employ the GSM8K [7], MATH500 [23], and AIME 2024 [11] datasets for constructing the reasoning graphs. Additional analyses on non-mathematical tasks, including Strategy QA [22] and Logical Deduction from BIG-Bench [54], are provided in Appendix E. Our experiments utilized two versions of the dataset: the original version (s1-v1.0 1) and an updated version (s1-v1.1 2), each consisting of 1000 training samples. 1https://huggingface.co/datasets/simplescaling/s1K 2https://huggingface.co/datasets/simplescaling/s1K-1.1. To further investigate the relationship between data quality and reasoning-graph properties, we compared two supervised fine-tuning (SFT) datasets: LIMO [72] and s1 v1.0 [44].
Dataset Splits	Yes	Our experiments utilized two versions of the dataset: the original version (s1-v1.0 1) and an updated version (s1-v1.1 2), each consisting of 1000 training samples. We employ the GSM8K [7], MATH500 [23], and AIME 2024 [11] datasets for constructing the reasoning graphs.
Hardware Specification	Yes	The training was executed on a computing node equipped with 8 NVIDIA H200 GPUs for training and a single NVIDIA H200 GPU for inference.
Software Dependencies	No	The paper does not explicitly state software versions for operating systems, programming languages, or libraries like Python, PyTorch, or CUDA versions. It only mentions model-related configurations such as 'Optimizer Adam W' and 'bf16 precision' in Table 3.
Experiment Setup	Yes	Table 3: Detailed training configuration for the SFT experiments. Parameter Value Base Model Qwen2.5-32B-Instruct Dataset simplescaling/s1K or simplescaling/s1K-1.1 Number of Epochs 5 Learning Rate 1 10 5 Learning Rate Scheduler Cosine (minimum LR: 0) Batch Size 8 (Effective: 8 GPUs micro-batch size 1) Gradient Accumulation Steps 1 Weight Decay 1 10 4 Optimizer Adam W (β1 = 0.9, β2 = 0.95) Warmup Ratio 0.05 Precision bf16 Gradient Checkpointing Enabled FSDP Configuration Full Shard Data Parallel (auto-wrap) Block Size 32768 tokens