Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GRIP: A Graph-Based Reasoning Instruction Producer

Authors: Jiankang Wang, Jianjun Xu, Xiaorui Wang, Yuxin Wang, Mengting Xing, Shancheng Fang, Hongtao Xie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On mathematical reasoning benchmarks, models trained with GRIP-MATH demonstrate substantial improvements over their base models and significantly outperform previous data synthesis methods. Furthermore, GRIP-MATH also enhances scientific reasoning performance, highlighting the strong generalization ability of GRIP. Section 4: Experiments, Training Setup, Evaluation Datasets, Main Results, Results on Scientific Reasoning Benchmark, Ablation Studies about GRIP.
Researcher Affiliation	Collaboration	1University of Science and Technology of China 2Meta Stone Technology, Beijing, China EMAIL
Pseudocode	No	The paper describes its methodology in Section 3, titled 'Proposed Method', detailing four steps: Knowledge Base Construction, Key Concepts Relationship Graph Construction, Synthesis Based on Diverse Key Concept Combinations, and Multi-Model Evaluation. While these steps describe a process, they are presented in narrative text and diagrams (Figure 2), not as a structured pseudocode or algorithm block.
Open Source Code	No	We will also open-source the code and dataset. Due to company intellectual property limits, we can only provide part of the dataset during the review process, but all code and datasets will be open-sourced immediately after the review period ends.
Open Datasets	No	We use 7.5K question-solution pairs from the MATH training set as seed data and synthesize a new dataset, GRIP-MATH, which contains over 2.1 million math question-solution pairs. We use GRIP-MATH to train large language models (LLMs) with diverse architectures and parameter sizes... Due to company intellectual property limits, we can only provide part of the dataset in the supplementary materials during the review process, but all code and datasets will be open-sourced immediately after the review period ends.
Dataset Splits	No	We use 7.5K question-solution pairs from the MATH training set as seed data and synthesize a new dataset, GRIP-MATH, which contains over 2.1 million math question-solution pairs... The fine-tuning is performed using the LLa MAFactory [44] framework over 2 epochs... trained all of them exclusively on the GRIP-MATH dataset. The paper mentions external evaluation datasets with implicit splits, but does not provide specific splits for the GRIP-MATH dataset itself for training, validation, or testing.
Hardware Specification	Yes	The synthesis with GRIP was completed in 36 hours using 8 NVIDIA A100 GPUs and v LLM [18]... The cost of using one NVIDIA RTX A100 (80G) is $0.42 per hour.
Software Dependencies	No	The fine-tuning is performed using the LLa MAFactory [44] framework over 2 epochs... For expedited and efficient training, we leveraged Deep Speed [25] Ze RO Stage 3 and Flash Attention 2 [9]. The paper mentions frameworks and tools used but does not provide specific version numbers for them.
Experiment Setup	Yes	The fine-tuning is performed using the LLa MAFactory [44] framework over 2 epochs, with a learning rate of 5e-6, a global batch size of 128, and a maximum sequence length of 4096. A cosine schedule with a 3% warm-up ratio is adopted to regulate the learning rate.