Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Authors: Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Bing Xiang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.
Researcher Affiliation Collaboration Shuaichen Chang1 , Jun Wang2, Mingwen Dong2, Lin Pan2, Henghui Zhu2, Alexander Hanbo Li2, Wuwei Lan2, Sheng Zhang2, Jiarong Jiang2, Joseph Lilien2, Steve Ash2, William Wang2, Zhiguo Wang2, Vittorio Castelli2, Bing Xiang2, Patrick Ng2 1 Ohio State University, 2 AWS AI Labs
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes 1Our data and code are available at https://github.com/awslabs/ diagnostic-robustness-text-to-sql.
Open Datasets Yes DR.SPIDER: A DIAGNOSTIC EVALUATION BENCHMARK TOWARDS TEXT-TO-SQL ROBUSTNESS ... based on Spider, a cross-domain text-to-SQL benchmark... We apply task-specific perturbations to create our benchmark based on the Spider development set, as the Spider test set is not public.
Dataset Splits Yes We evaluate multiple representative text-to-SQL models on Dr.Spider, which are trained on the Spider training set: ... We apply task-specific perturbations to create our benchmark based on the Spider development set, as the Spider test set is not public. Dr.Spider contains 3 DB perturbation test sets, 9 NLQ perturbation test sets, and 5 SQL perturbation test sets to simulate various task-specific phenomena.
Hardware Specification Yes The experiments were done on 8 Nvidia Tesla V100 with 32G memory for about 5 days.
Software Dependencies No We implement the paraphrase generation with Huggingface Wolf et al. (2019). No specific version numbers for software dependencies were provided.
Experiment Setup Yes We choose the OPT (Zhang et al., 2022) with 66B parameters as the PLM model. ... For each question, we run the OPT model 4 times with the hyperparameters top p in {0.9,1.0} and temperature in {0.7,1.0}, and 5 paraphrases are returned each time (Keskar et al., 2019).