DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction
Authors: Mohammadreza Pourreza, Davood Rafiei
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with three LLMs show that this approach consistently improves their simple few-shot performance by roughly 10%, pushing the accuracy of LLMs towards SOTA or surpassing it. On the holdout test set of Spider, the SOTA, in terms of execution accuracy, was 79.9 and the new SOTA at the time of this writing using our approach is 85.3. Our approach with in-context learning beats many heavily fine-tuned models by at least 5%. Additionally, when evaluated on the BIRD benchmark, our approach achieved an execution accuracy of 55.9%, setting a new SOTA on its holdout test set. |
| Researcher Affiliation | Academia | Mohammadreza Pourreza Department of Computer Science University of Alberta Edmonton, CA pourreza@ualberta.ca Davood Rafiei Department of Computer Science University of Alberta Edmonton, CA drafiei@ualberta.ca |
| Pseudocode | No | The paper describes its methodology in detail but does not present any pseudocode or algorithm blocks. |
| Open Source Code | Yes | To replicate the reported results, visit our Git Hub repository 1 for access to the prompts, results, and the code. 1https://github.com/MohammadrezaPourreza/Few-shot-NL2SQL-with-prompting |
| Open Datasets | Yes | Our evaluation was conducted on two crossdomain challenging datasets, Spider and BIRD. Spider consists of 10,181 questions and 5,693 unique complex SQL queries across 200 databases, covering 138 domains, each containing multiple tables. The standard protocol for this dataset divides it into 8,659 training examples across 146 databases, 1,034 development examples across 20 databases, and a holdout of 2,147 test examples across 34 databases. |
| Dataset Splits | Yes | Our evaluation was conducted on two crossdomain challenging datasets, Spider and BIRD. Spider consists of 10,181 questions and 5,693 unique complex SQL queries across 200 databases, covering 138 domains, each containing multiple tables. The standard protocol for this dataset divides it into 8,659 training examples across 146 databases, 1,034 development examples across 20 databases, and a holdout of 2,147 test examples across 34 databases. |
| Hardware Specification | No | All models were accessed via the Open AI API. The paper does not specify the underlying hardware used by OpenAI for running the models. |
| Software Dependencies | No | All models were accessed via the Open AI API. No specific software dependencies with version numbers (e.g., Python, PyTorch, libraries) are mentioned beyond the general API access. |
| Experiment Setup | Yes | All models were accessed via the Open AI API. Greedy decoding was used to generate the output by setting the temperature at zero. The max tokens was set to 350 for the self-correction module and 600 for all other modules. The stopping token sequence was set to #;\n \n for the self-correction module and Q: for all other modules. |