reproducibilityindex.ai

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

Authors: Mohammadreza Pourreza, Davood Rafiei

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments with three LLMs show that this approach consistently improves their simple few-shot performance by roughly 10%, pushing the accuracy of LLMs towards SOTA or surpassing it. On the holdout test set of Spider, the SOTA, in terms of execution accuracy, was 79.9 and the new SOTA at the time of this writing using our approach is 85.3. Our approach with in-context learning beats many heavily ﬁne-tuned models by at least 5%. Additionally, when evaluated on the BIRD benchmark, our approach achieved an execution accuracy of 55.9%, setting a new SOTA on its holdout test set.
Researcher Affiliation	Academia	Mohammadreza Pourreza Department of Computer Science University of Alberta Edmonton, CA pourreza@ualberta.ca Davood Raﬁei Department of Computer Science University of Alberta Edmonton, CA drafiei@ualberta.ca
Pseudocode	No	The paper describes its methodology in detail but does not present any pseudocode or algorithm blocks.
Open Source Code	Yes	To replicate the reported results, visit our Git Hub repository 1 for access to the prompts, results, and the code. 1https://github.com/MohammadrezaPourreza/Few-shot-NL2SQL-with-prompting
Open Datasets	Yes	Our evaluation was conducted on two crossdomain challenging datasets, Spider and BIRD. Spider consists of 10,181 questions and 5,693 unique complex SQL queries across 200 databases, covering 138 domains, each containing multiple tables. The standard protocol for this dataset divides it into 8,659 training examples across 146 databases, 1,034 development examples across 20 databases, and a holdout of 2,147 test examples across 34 databases.
Dataset Splits	Yes	Our evaluation was conducted on two crossdomain challenging datasets, Spider and BIRD. Spider consists of 10,181 questions and 5,693 unique complex SQL queries across 200 databases, covering 138 domains, each containing multiple tables. The standard protocol for this dataset divides it into 8,659 training examples across 146 databases, 1,034 development examples across 20 databases, and a holdout of 2,147 test examples across 34 databases.
Hardware Specification	No	All models were accessed via the Open AI API. The paper does not specify the underlying hardware used by OpenAI for running the models.
Software Dependencies	No	All models were accessed via the Open AI API. No specific software dependencies with version numbers (e.g., Python, PyTorch, libraries) are mentioned beyond the general API access.
Experiment Setup	Yes	All models were accessed via the Open AI API. Greedy decoding was used to generate the output by setting the temperature at zero. The max tokens was set to 350 for the self-correction module and 600 for all other modules. The stopping token sequence was set to #;\n \n for the self-correction module and Q: for all other modules.