reproducibilityindex.ai

Prompt Sketching for Large Language Models

Authors: Luca Beurer-Kellner, Mark Niklas Mueller, Marc Fischer, Martin Vechev

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that in a zero-shot setting, prompt sketching outperforms existing, sequential prompting schemes such as direct asking or chain-of-thought on 7 out of 8 LLM benchmarking tasks, including state tracking, arithmetic reasoning, and general question answering.
Researcher Affiliation	Academia	Luca Beurer-Kellner 1 Mark Niklas Mueller 1 Marc Fischer 1 Martin Vechev 1 1Department of Computer Science, ETH Zürich, Switzerland. Correspondence to: Luca Beurer-Kellner <luca.beurerkellner@inf.ethz.ch>.
Pseudocode	Yes	See App. A, for a pseudo-code implementation of VAR. A pseudo-code implementation of BEAMVAR can be found in App. A.
Open Source Code	Yes	To facilitate future use, we release a number of generic, yet effective sketches applicable to many tasks, and an open source library called dclib, powering our sketch-aware decoders as part of https://github.com/eth-sri/lmql.
Open Datasets	Yes	AQuA (Ling et al., 2017), Strategy QA (Geva et al., 2021), GSM8K (Cobbe et al., 2021) - all are standard datasets.
Dataset Splits	No	The paper evaluates on samples from existing benchmarks but does not specify custom training/validation/test splits or reference how the datasets themselves are partitioned for those stages of model development.
Hardware Specification	Yes	For Llama-2, on the other hand, we run all of our experiments on 1000 samples per task (or the full datasets), using a single NVIDIA H100 GPU with 80GB memory.
Software Dependencies	No	The paper mentions LMQL and dclib library but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For BEAM, VAR, and BEAMVAR we use a beam width of n = 2 and rely on length normalized scoring in line with previous work (Wu et al., 2016), using β = 0 and α = 0.7.