Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning

Authors: Peixian Ma, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, Jian Guo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In existing experiments, SQL-R1 achieves execution accuracy of 88.6% and 67.1% on the benchmark Spider and BIRD, respectively. 3 Experiments
Researcher Affiliation	Collaboration	1IDEA Research, International Digital Economy Academy 2The Hong Kong University of Science and Technology (Guangzhou) 3University of Chinese Academy of Sciences 4Data Arc Tech Ltd.
Pseudocode	No	The paper describes the training process and algorithms but does not provide structured pseudocode or algorithm blocks. It shows prompt templates in figures 11 and 12 but these are not pseudocode for the methodology.
Open Source Code	Yes	The code is available at https://github.com/IDEA-FinAI/SQL-R1.
Open Datasets	Yes	Currently, we utilize the Syn SQL-2.5M [22] dataset as primary data source... We evaluated the proposed SQL-R1 and related NL2SQL models on two benchmarks, Spider [24] and BIRD [25].
Dataset Splits	Yes	SFT Dataset. ... we utilized a dataset comprising 200,000 samples drawn from the Syn SQL-2.5M for the SFT training, whose sample size is uniform across different difficulty levels, with each level comprising 50000 samples. ... RL Dataset. ... We randomly sampled 5K NL-SQL pairs from Syn SQL-2.5M... Evaluation Benchmark. We evaluated the proposed SQL-R1 and related NL2SQL models on two benchmarks, Spider [24] and BIRD [25].
Hardware Specification	Yes	Environment. All experiments conducted in this study are performed on a server operating under the Ubuntu 20.04 Linux distribution. This server is equipped with Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60 GHz CPU, and is complemented by 512 GB of system memory. The environment for training open-source LLMs comprises a configuration of 8 GPUs, each possessing 80 GB of memory and delivering a performance capacity of 312 TFLOPS when utilizing BF16 precision.
Software Dependencies	No	Environment. All experiments conducted in this study are performed on a server operating under the Ubuntu 20.04 Linux distribution. This only specifies the operating system, not other key software dependencies with specific version numbers for frameworks or libraries.
Experiment Setup	Yes	For the SFT training, we set the learning rate as 5e-5; batch size as 1. For the RL training, we set the learning rate as 3e-7, rollout of actor model as 8; max response length as 2048. For inference, we set the count of SQL candidates as 8 and the temperature as 0.8.