reproducibilityindex.ai

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Authors: Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, Tao Yu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as Num Py and Pandas. Compared to prior works, DS1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from Stack Overflow. Second, our automatic evaluation is highly specific (reliable) across all Codex-002predicted solutions that our evaluation accepts, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original Stack Overflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement.
Researcher Affiliation	Collaboration	1The University of Hong Kong 2Peking University 3Stanford University 4UC Berkeley 5University of Washington 6Meta AI 7Carnegie Mellon University.
Pseudocode	No	The paper includes code examples and diagrams illustrating processes, but it does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	We release our benchmark at https://ds1000-code-gen. github.io.
Open Datasets	Yes	We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as Num Py and Pandas. We release our benchmark at https://ds1000-code-gen. github.io.
Dataset Splits	Yes	We evaluate our multi-criteria automatic metric by checking whether it can reject incorrect solutions. We randomly sampled 10 problems from each library and sampled 40 predictions from Codex-002 for each problem (2800 problemcode examples in total).
Hardware Specification	No	For DS-1000, evaluating generated codes does not require special computational resources like GPUs.
Software Dependencies	Yes	We fixed the evaluation environment to include the latest versions of libraries that can be installed with Python 3.7.10 and present the detailed documentation in Appendix A.1. Table 7: The versions of software in DS-1000
Experiment Setup	Yes	We generate 40 samples for each DS-1000 problem with temperature set to 0.2, top-p cutoff set to 0.95, and max generation length set to 1024. We set the stop sequence tokens to </code> and # SOLUTION END .