PAC Prediction Sets for Large Language Models of Code
Authors: Adam Khakhar, Stephen Mell, Osbert Bastani
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on PICARD (a T5 model for SQL semantic parsing) and Codex (a GPT model for over a dozen programming languages, including Python), demonstrating that our approach generates compact PAC prediction sets. This is the first research contribution that generates PAC prediction sets for generative code models.1...We empirically evaluate our approach on both PICARD (Scholak et al., 2021), a state-of-the-art semantic parser based on T5 (Raffel et al., 2019), trained on the Spider dataset (Yu et al., 2018) for SQL semantic parsing, as well as Codex (Chen et al., 2021)...Our experiments demonstrate that our approach can generate prediction sets of the desired form that satisfy the PAC guarantee, while significantly outperforming a natural baseline in terms of a measure of prediction set size. |
| Researcher Affiliation | Academia | 1University of Pennsylvania, USA. Correspondence to: Adam Khakhar <ak@alumni.upenn.edu>. |
| Pseudocode | Yes | Algorithm 1 Our structured PAC prediction set algorithm. |
| Open Source Code | Yes | Our code is available at https://github.com/adamkhakhar/python-pac-code-prediction-set. |
| Open Datasets | Yes | This model is trained on the Spider dataset (Yu et al., 2018), a large multi-domain and cross-database dataset for SQL semantic parsing. ... For our experiments, we use natural language to Python code datasets including APPS (Hendrycks et al., 2021) and Human Eval: Hand-Written Evaluation Set (Chen et al., 2021). |
| Dataset Splits | No | The paper mentions using a 'validation set' and states 'For our experiments, we use 7,000 examples from Spider to construct prediction sets.' However, it does not provide specific breakdown percentages or counts for training, validation, and test splits for reproducibility. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory, or cloud instance specifications). |
| Software Dependencies | No | The paper mentions using 'PICARD (Scholak et al., 2021)' and 'Codex (Chen et al., 2021)' which are models, but it does not specify software dependencies like programming language versions or library versions (e.g., Python, PyTorch, TensorFlow with specific version numbers) needed for reproducibility. |
| Experiment Setup | Yes | The main hyperparameter is the choice of search space over τ; we used the simple and natural choice of covering uniformly in increments of 0.01. In addition, we choose δ = 0.01; we vary ϵ in our results. ... We include constraints bounding how many subtrees we remove, and enforcing that if we remove a subtree, we remove all nodes in that subtree: Eq. 4: We remove at most m subtrees. |