SatLM: Satisfiability-Aided Language Models Using Declarative Prompting

Authors: Xi Ye, Qiaochu Chen, Isil Dillig, Greg Durrett

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
Researcher Affiliation Academia Xi Ye Qiaochu Chen Isil Dillig Greg Durrett Department of Computer Science The University of Texas at Austin {xiye,qchen,isil,gdurrett}@cs.utexas.edu
Pseudocode Yes An example of the specification is presented on the right of Figure 1. Our specification is declarative since we do not explicitly generate the ri from the LLM at this stage.
Open Source Code Yes 1Code available at https://github.com/xiye17/SAT-LM.
Open Datasets Yes We list all dataset statistics in Appendix A. For arithmetic reasoning, we use GSM (Cobbe et al., 2021), GSM-SYS, and ALGEBRA (He-Yueya et al., 2023).
Dataset Splits No The paper discusses few-shot exemplars and test sets but does not explicitly specify a separate validation dataset split with percentages or counts for model training/tuning.
Hardware Specification No We conduct our main experiments and analysis on code-davinci-002 (Chen et al., 2021), a state-of-art LLM for code and code-adjacent tasks.
Software Dependencies No The paper mentions using 'Python syntax' and 'Z3 SMT solver' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We evaluate the performance with both greedy decoding and self-consistency decoding (Wang et al., 2022b). Following past work (Gao et al., 2023), we use 40 samples on all datasets except for LSAT, BOARDGAMEQA, and PROOFWRITER; we use 5 samples on these datasets involving long prompts and high computation cost. For COT and PROGLM, we use a temperature of 0.7; for SATLM, we use a higher temperature of 0.9, which we find to work better.