Prompt Sketching for Large Language Models
Authors: Luca Beurer-Kellner, Mark Niklas Mueller, Marc Fischer, Martin Vechev
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that in a zero-shot setting, prompt sketching outperforms existing, sequential prompting schemes such as direct asking or chain-of-thought on 7 out of 8 LLM benchmarking tasks, including state tracking, arithmetic reasoning, and general question answering. |
| Researcher Affiliation | Academia | Luca Beurer-Kellner 1 Mark Niklas Mueller 1 Marc Fischer 1 Martin Vechev 1 1Department of Computer Science, ETH Zürich, Switzerland. Correspondence to: Luca Beurer-Kellner <luca.beurerkellner@inf.ethz.ch>. |
| Pseudocode | Yes | See App. A, for a pseudo-code implementation of VAR. A pseudo-code implementation of BEAMVAR can be found in App. A. |
| Open Source Code | Yes | To facilitate future use, we release a number of generic, yet effective sketches applicable to many tasks, and an open source library called dclib, powering our sketch-aware decoders as part of https://github.com/eth-sri/lmql. |
| Open Datasets | Yes | AQuA (Ling et al., 2017), Strategy QA (Geva et al., 2021), GSM8K (Cobbe et al., 2021) - all are standard datasets. |
| Dataset Splits | No | The paper evaluates on samples from existing benchmarks but does not specify custom training/validation/test splits or reference how the datasets themselves are partitioned for those stages of model development. |
| Hardware Specification | Yes | For Llama-2, on the other hand, we run all of our experiments on 1000 samples per task (or the full datasets), using a single NVIDIA H100 GPU with 80GB memory. |
| Software Dependencies | No | The paper mentions LMQL and dclib library but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For BEAM, VAR, and BEAMVAR we use a beam width of n = 2 and rely on length normalized scoring in line with previous work (Wu et al., 2016), using β = 0 and α = 0.7. |