Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness
Authors: Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Bing Xiang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness. |
| Researcher Affiliation | Collaboration | Shuaichen Chang1 , Jun Wang2, Mingwen Dong2, Lin Pan2, Henghui Zhu2, Alexander Hanbo Li2, Wuwei Lan2, Sheng Zhang2, Jiarong Jiang2, Joseph Lilien2, Steve Ash2, William Wang2, Zhiguo Wang2, Vittorio Castelli2, Bing Xiang2, Patrick Ng2 1 Ohio State University, 2 AWS AI Labs |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | 1Our data and code are available at https://github.com/awslabs/ diagnostic-robustness-text-to-sql. |
| Open Datasets | Yes | DR.SPIDER: A DIAGNOSTIC EVALUATION BENCHMARK TOWARDS TEXT-TO-SQL ROBUSTNESS ... based on Spider, a cross-domain text-to-SQL benchmark... We apply task-specific perturbations to create our benchmark based on the Spider development set, as the Spider test set is not public. |
| Dataset Splits | Yes | We evaluate multiple representative text-to-SQL models on Dr.Spider, which are trained on the Spider training set: ... We apply task-specific perturbations to create our benchmark based on the Spider development set, as the Spider test set is not public. Dr.Spider contains 3 DB perturbation test sets, 9 NLQ perturbation test sets, and 5 SQL perturbation test sets to simulate various task-specific phenomena. |
| Hardware Specification | Yes | The experiments were done on 8 Nvidia Tesla V100 with 32G memory for about 5 days. |
| Software Dependencies | No | We implement the paraphrase generation with Huggingface Wolf et al. (2019). No specific version numbers for software dependencies were provided. |
| Experiment Setup | Yes | We choose the OPT (Zhang et al., 2022) with 66B parameters as the PLM model. ... For each question, we run the OPT model 4 times with the hyperparameters top p in {0.9,1.0} and temperature in {0.7,1.0}, and 5 paraphrases are returned each time (Keskar et al., 2019). |