reproducibilityindex.ai

PAC Prediction Sets for Large Language Models of Code

Authors: Adam Khakhar, Stephen Mell, Osbert Bastani

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on PICARD (a T5 model for SQL semantic parsing) and Codex (a GPT model for over a dozen programming languages, including Python), demonstrating that our approach generates compact PAC prediction sets. This is the first research contribution that generates PAC prediction sets for generative code models.1...We empirically evaluate our approach on both PICARD (Scholak et al., 2021), a state-of-the-art semantic parser based on T5 (Raffel et al., 2019), trained on the Spider dataset (Yu et al., 2018) for SQL semantic parsing, as well as Codex (Chen et al., 2021)...Our experiments demonstrate that our approach can generate prediction sets of the desired form that satisfy the PAC guarantee, while significantly outperforming a natural baseline in terms of a measure of prediction set size.
Researcher Affiliation	Academia	1University of Pennsylvania, USA. Correspondence to: Adam Khakhar <ak@alumni.upenn.edu>.
Pseudocode	Yes	Algorithm 1 Our structured PAC prediction set algorithm.
Open Source Code	Yes	Our code is available at https://github.com/adamkhakhar/python-pac-code-prediction-set.
Open Datasets	Yes	This model is trained on the Spider dataset (Yu et al., 2018), a large multi-domain and cross-database dataset for SQL semantic parsing. ... For our experiments, we use natural language to Python code datasets including APPS (Hendrycks et al., 2021) and Human Eval: Hand-Written Evaluation Set (Chen et al., 2021).
Dataset Splits	No	The paper mentions using a 'validation set' and states 'For our experiments, we use 7,000 examples from Spider to construct prediction sets.' However, it does not provide specific breakdown percentages or counts for training, validation, and test splits for reproducibility.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory, or cloud instance specifications).
Software Dependencies	No	The paper mentions using 'PICARD (Scholak et al., 2021)' and 'Codex (Chen et al., 2021)' which are models, but it does not specify software dependencies like programming language versions or library versions (e.g., Python, PyTorch, TensorFlow with specific version numbers) needed for reproducibility.
Experiment Setup	Yes	The main hyperparameter is the choice of search space over τ; we used the simple and natural choice of covering uniformly in increments of 0.01. In addition, we choose δ = 0.01; we vary ϵ in our results. ... We include constraints bounding how many subtrees we remove, and enforcing that if we remove a subtree, we remove all nodes in that subtree: Eq. 4: We remove at most m subtrees.