Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

Authors: Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, Yihong Dong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply SATURN to Deep Seek-R1-Distill-Qwen and obtain SATURN-1.5B and SATURN-7B. We achieve several notable results: On SAT problems, SATURN-1.5B and SATURN-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. On math and programming tasks, SATURN-1.5B and SATURN-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, Live Code Bench). We release the source code, data, and models to support future research at https: //github.com/gtxygyzb/Saturn-code.
Researcher Affiliation	Academia	Huanyu Liu Peking University EMAIL; Ge Li Peking University EMAIL; Jia Li Tsinghua University jia_li@mail .tsinghua.edu.cn; Hao Zhu Peking University EMAIL; Kechi Zhang Peking University EMAIL; Yihong Dong Peking University EMAIL
Pseudocode	Yes	Appendix A: Pseudocode of SATURN Algorithm; Algorithm 1 SATURN Learning_Loop(n, k, l, πθ)
Open Source Code	Yes	We release the source code, data, and models to support future research at https: //github.com/gtxygyzb/Saturn-code.
Open Datasets	Yes	We introduce the SATURN-2.6k dataset, consisting of 1,500 training instances, 160 test instances at the same difficulty as the training set, and 1,000 test instances from 10 harder unseen difficulty levels. We release SAT construction scripts alongside the dataset, which enable the creation of virtually unlimited SAT instances. For math and programming tasks, following Deep Seek-AI [11], we use AIME 24/25 [2], AMC 22/23 [1], MATH-500 [19], GPQA Diamond [35], and Live Code Bench v4_v5 subset [22].
Dataset Splits	Yes	We introduce the SATURN-2.6k dataset, consisting of 1,500 training instances, 160 test instances at the same difficulty as the training set, and 1,000 test instances from 10 harder unseen difficulty levels, with 100 instances per level. Table 6: SATURN Hyperparameters: Training set size per step (Train_size) 250, Validation set size per step (Val_size) 40.
Hardware Specification	Yes	All experiments are conducted on NVIDIA 8 A100 (40GB) GPUs. We conduct all experiments on 8 NVIDIA A100 40GB GPUs.
Software Dependencies	No	We use the Open RLHF framework2 for GRPO training. We use the Hugging Face lighteval library3 for math and programming evaluations. These are specific tools but no version numbers are provided.
Experiment Setup	Yes	For SATURN-1.5B and SATURN-7B, we set the initial SAT instance parameters (n, k, l) to (3, 5, 5) and (3, 5, 13), respectively. In Curriculum Estimation Loop, the ϵ threshold is set to 0.5 for the 1.5B model and 0.75 for the 7B model. In LLMs Training Loop, we evaluate the pass@k with a step size of 250 training samples. The total number of curriculum iterations is set to 2. Detailed hyperparameters are provided in Appendix A. Table 6: SATURN Hyperparameters. Table 8: Open RLHF Training Hyperparameters. Table 9: Evaluation Hyperparameters for Hugging Face-Open-R1.