DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
Authors: Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, Tao Yu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as Num Py and Pandas. Compared to prior works, DS1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from Stack Overflow. Second, our automatic evaluation is highly specific (reliable) across all Codex-002predicted solutions that our evaluation accepts, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original Stack Overflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2Peking University 3Stanford University 4UC Berkeley 5University of Washington 6Meta AI 7Carnegie Mellon University. |
| Pseudocode | No | The paper includes code examples and diagrams illustrating processes, but it does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | We release our benchmark at https://ds1000-code-gen. github.io. |
| Open Datasets | Yes | We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as Num Py and Pandas. We release our benchmark at https://ds1000-code-gen. github.io. |
| Dataset Splits | Yes | We evaluate our multi-criteria automatic metric by checking whether it can reject incorrect solutions. We randomly sampled 10 problems from each library and sampled 40 predictions from Codex-002 for each problem (2800 problemcode examples in total). |
| Hardware Specification | No | For DS-1000, evaluating generated codes does not require special computational resources like GPUs. |
| Software Dependencies | Yes | We fixed the evaluation environment to include the latest versions of libraries that can be installed with Python 3.7.10 and present the detailed documentation in Appendix A.1. Table 7: The versions of software in DS-1000 |
| Experiment Setup | Yes | We generate 40 samples for each DS-1000 problem with temperature set to 0.2, top-p cutoff set to 0.95, and max generation length set to 1024. We set the stop sequence tokens to </code> and # SOLUTION END . |