Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing
Authors: Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen Zhan, Jian-Guang Lou
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages. Qualitative and quantitative analyses are conducted to understand the reason for the performance drop of each language. |
| Researcher Affiliation | Collaboration | Longxu Dou1, Yan Gao2, Mingyang Pan1, Dingzirui Wang1, Wanxiang Che1, Dechen Zhan1, Jian-Guang Lou2 1 Harbin Institute of Technology 2 Microsoft Research Asia |
| Pseudocode | No | The paper describes methods in prose and flowcharts (Figure 4, 5) but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available at https://github.com/microsoft/Contextual SP |
| Open Datasets | Yes | We build MULTISPIDER based on Spider (Yu et al. 2018), a large-scale cross-database text-to-SQL dataset in English. We also collect data from the CSpider (Min and Zhang 2019) and VSpider (Tuan Nguyen, Dao, and Nguyen 2020), which are also free and open text-to SQL dataset. |
| Dataset Splits | Yes | Only 9691 questions and 5263 SQL queries over 166 databases (train-set and dev-set) are publicly available. |
| Hardware Specification | No | The paper does not explicitly provide details about the specific hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions specific models and frameworks (e.g., m BERT, XLM-Roberta-Large, m BART, RAT-SQL) with citations but does not provide specific version numbers for software dependencies or libraries used in their implementation. |
| Experiment Setup | Yes | Training with Augmented Data During the training phase, we first adopt the augmented data to warm up the model three epochs to alleviate the noise in augmented data, then fine-tune the model with original high-quality training data. |