Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GRIP: A Graph-Based Reasoning Instruction Producer
Authors: Jiankang Wang, Jianjun Xu, Xiaorui Wang, Yuxin Wang, Mengting Xing, Shancheng Fang, Hongtao Xie
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On mathematical reasoning benchmarks, models trained with GRIP-MATH demonstrate substantial improvements over their base models and significantly outperform previous data synthesis methods. Furthermore, GRIP-MATH also enhances scientific reasoning performance, highlighting the strong generalization ability of GRIP. Section 4: Experiments, Training Setup, Evaluation Datasets, Main Results, Results on Scientific Reasoning Benchmark, Ablation Studies about GRIP. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China 2Meta Stone Technology, Beijing, China EMAIL |
| Pseudocode | No | The paper describes its methodology in Section 3, titled 'Proposed Method', detailing four steps: Knowledge Base Construction, Key Concepts Relationship Graph Construction, Synthesis Based on Diverse Key Concept Combinations, and Multi-Model Evaluation. While these steps describe a process, they are presented in narrative text and diagrams (Figure 2), not as a structured pseudocode or algorithm block. |
| Open Source Code | No | We will also open-source the code and dataset. Due to company intellectual property limits, we can only provide part of the dataset during the review process, but all code and datasets will be open-sourced immediately after the review period ends. |
| Open Datasets | No | We use 7.5K question-solution pairs from the MATH training set as seed data and synthesize a new dataset, GRIP-MATH, which contains over 2.1 million math question-solution pairs. We use GRIP-MATH to train large language models (LLMs) with diverse architectures and parameter sizes... Due to company intellectual property limits, we can only provide part of the dataset in the supplementary materials during the review process, but all code and datasets will be open-sourced immediately after the review period ends. |
| Dataset Splits | No | We use 7.5K question-solution pairs from the MATH training set as seed data and synthesize a new dataset, GRIP-MATH, which contains over 2.1 million math question-solution pairs... The fine-tuning is performed using the LLa MAFactory [44] framework over 2 epochs... trained all of them exclusively on the GRIP-MATH dataset. The paper mentions external evaluation datasets with implicit splits, but does not provide specific splits for the GRIP-MATH dataset itself for training, validation, or testing. |
| Hardware Specification | Yes | The synthesis with GRIP was completed in 36 hours using 8 NVIDIA A100 GPUs and v LLM [18]... The cost of using one NVIDIA RTX A100 (80G) is $0.42 per hour. |
| Software Dependencies | No | The fine-tuning is performed using the LLa MAFactory [44] framework over 2 epochs... For expedited and efficient training, we leveraged Deep Speed [25] Ze RO Stage 3 and Flash Attention 2 [9]. The paper mentions frameworks and tools used but does not provide specific version numbers for them. |
| Experiment Setup | Yes | The fine-tuning is performed using the LLa MAFactory [44] framework over 2 epochs, with a learning rate of 5e-6, a global batch size of 128, and a maximum sequence length of 4096. A cosine schedule with a 3% warm-up ratio is adopted to regulate the learning rate. |