Can LLMs Implicitly Learn Numeric Parameter Constraints in Data Science APIs?

Authors: Yinlin Deng, Chunqiu Steven Xia, Zhezhen Cao, Meiziniu Li, LINGMING ZHANG

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we empirically investigate the proficiency of LLMs to handle these implicit numerical constraints when generating DS programs. We studied 28 widely used APIs from Py Torch and Num Py, and scrutinized the LLMs generation performance in different levels of granularity: full programs, all parameters, and individual parameters of a single API. We evaluated both state-of-the-art open-source and closed-source models.
Researcher Affiliation Academia Yinlin Deng Chunqiu Steven Xia Zhezhen Cao Meiziniu Li Lingming Zhang University of Illinois Urbana-Champaign Southern University of Science and Technology The Hong Kong University of Science and Technology {yinlind2,chunqiu2,lingming}@illinois.edu, 12110529@mail.sustech.edu.cn, mlick@cse.ust.hk
Pseudocode No The paper describes procedures but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The NeurIPS Paper Checklist (Question 5, Justification) states: 'Yes. The data and code will be made publicly available soon, along with detailed instructions to replicate the main experimental results.'
Open Datasets No Based on our experimental findings, we constructed DSEVAL, the first benchmark for systematically evaluating LLMs capabilities in understanding the important numerical API constraints for popular DS libraries. DSEVAL contains 19,600 different problems... The data and code will be made publicly available soon.
Dataset Splits No The paper describes generating input problems for evaluation in 'difficulty settings' (e.g., '14 difficulty settings, each with 200 different inputs per API') for pre-trained LLMs, but it does not specify a train/validation/test split for a dataset used to train or fine-tune a model presented in the paper.
Hardware Specification Yes We perform both LLM generation and evaluation on an 64-core workstation with 256 GB RAM running Ubuntu 20.04.5 LTS. For local open-source LLMs, we use NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions software like 'Py Torch', 'Num Py', 'Z3 [13]' and various LLMs (e.g., 'Deep Seek Coder-33b', 'GPT-4-Turbo (2024-04-09)'), but does not provide specific version numbers for general programming languages or libraries required for replication (e.g., Python version, PyTorch version).
Experiment Setup Yes To support our analysis, we systematically created 3 generation settings: full program, all parameters, and individual parameters... For the all parameters setting, we have 14 difficulty settings, each with 200 different inputs per API, and use greedy decoding to obtain the LLM solutions... Unless otherwise stated, we use greedy decoding (i.e., temperature = 0) and temperature of 1 when sampling for diversity evaluation." and "We use greedy decoding and set max_new_tokens to 512 for all models and all APIs, except for torch.nn.Fold we use max_new_tokens=1024