Grammar Prompting for Domain-Specific Language Generation with Large Language Models

Authors: Bailin Wang, Zi Wang, Xuezhi Wang, Yuan Cao, Rif A. Saurous, Yoon Kim

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that grammar prompting can enable LLMs to perform competitively on a diverse set of DSL generation tasks, including semantic parsing (SMCal Flow, Overnight, Geo Query), PDDL planning, and SMILES-based molecule generation.
Researcher Affiliation Collaboration Bailin Wang Zi Wang Xuezhi Wang Yuan Cao Rif A. Saurous Yoon Kim Massachusetts Institute of Technology Google Deep Mind Google Research {bailinw, yoonkim}@mit.edu, {wangzi, xuezhiw, yuancao, rif}@google.com
Pseudocode Yes Algorithm 1 Earley-based Constrained Generation
Open Source Code Yes Code and data available at: https://github.com/berlino/grammar-prompting.
Open Datasets Yes We test our approach on standard semantic parsing benchmarks involving complex DSLs: SMCal Flow [6], which features human-generated utterances about calendar management (see Figure 2); Geo Query [99] which features queries against a US Geography database; and Overnight-Blocks [81]... The data contains 32 Acrylates, 11 Chain Extenders, and 11 Isocyanates (see appendix G of Guo et al. [29]).
Dataset Splits No The paper specifies training and test examples (e.g., "Geo Query and Overnight-Blk use 32 in-context examples, and SMCal Flow uses 16 examples." and Table 6 with "Train" and "Test" counts) but does not explicitly mention validation sets or their splits for reproducibility.
Hardware Specification No The paper mentions using specific LLM APIs such as "Codex-davinci-002 [13]", "GPT-3.5", "GPT-4", and "Pa LM 2-L [7]". While these are models hosted on specific hardware, the paper does not provide details about the local hardware (e.g., GPU, CPU models, memory) used by the authors to interact with these APIs or run any local computations.
Software Dependencies No The paper mentions several software tools and models, such as "Earley parser [18]", "Sentence-BERT [59]", "Retro* model [12]", and "Pyperplan [5]", but it does not specify version numbers for any of these components. It also mentions "Python" in a figure context but without a version.
Experiment Setup Yes Table 8: Hyperparameters for sampling specialized grammars b G (top) and the molecules by in grammar prompting for molecule generation. Standard prompting uses the same hyperparameters for y. (This table specifies Temperature, Presence Penalty, Frequency Penalty).