Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

Authors: Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, Sercan Arik

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present comprehensive evaluations on the efﬁcacy of proposed methodologies of CHASE-SQL. Our innovative candidate generation approaches demonstrate superior performance compared to traditional generic Co T prompts, illustrating their capability in guiding LLMs through the decomposition of complex problems into manageable intermediate steps. Furthermore, the proposed selection agent signiﬁcantly outperforms conventional consistency-based methods, contributing to the stateof-the-art results. Speciﬁcally, CHASE-SQL reaches an execution accuracy of 73.01% and 73.0% on the development set and test set of the challenging BIRD Text-to-SQL dataset which outperforms all of the published and undisclosed methods on this benchmark, by a large margin.
Researcher Affiliation	Collaboration	1Google Cloud, Sunnyvale, CA, USA 2Stanford University, Stanford, CA, USA
Pseudocode	Yes	Algorithm 1 Divide and Conquer Chain-of-Thought (Co T) Strategy for Text-to-SQL. Algorithm 2 Online Synthetic example generation strategy for Text-to-SQL. Algorithm 3 Picking the ﬁnal SQL query from a pool of candidates. Algorithm 4 Query ﬁxing method.
Open Source Code	No	The paper does not provide an explicit statement or link to the open-source code for the methodology described in this paper.
Open Datasets	Yes	We evaluate the performance of the proposed CHASE-SQL framework on two widely-recognized cross-domain datasets: BIRD (Li et al., 2024c) and Spider (Yu et al., 2018).
Dataset Splits	Yes	The Spider dataset is divided into non-overlapping training, development, and test sets similar to BIRD.
Hardware Specification	No	The paper mentions using Gemini and Claude models and training a Gemini 1.5 Flash model using Vertex AI tuning API, but does not provide specific hardware details such as GPU/CPU models or memory specifications.
Software Dependencies	Yes	Moreover, by leveraging entirely open-source models Mistral Large Model (AI, 2024) as the candidate generator and a ﬁne-tuned Qwen-2.5-coder model (Team, 2024) as the selector our method achieved a state-of-the-art performance of 70.33 on the BIRD development set with open-source models.
Experiment Setup	Yes	The Gemini 1.5 Flash model is trained for 10 epochs using a Lo RA adapter with a rank of 16 using Vertex AI tuning API.