SatLM: Satisfiability-Aided Language Models Using Declarative Prompting
Authors: Xi Ye, Qiaochu Chen, Isil Dillig, Greg Durrett
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm. |
| Researcher Affiliation | Academia | Xi Ye Qiaochu Chen Isil Dillig Greg Durrett Department of Computer Science The University of Texas at Austin {xiye,qchen,isil,gdurrett}@cs.utexas.edu |
| Pseudocode | Yes | An example of the specification is presented on the right of Figure 1. Our specification is declarative since we do not explicitly generate the ri from the LLM at this stage. |
| Open Source Code | Yes | 1Code available at https://github.com/xiye17/SAT-LM. |
| Open Datasets | Yes | We list all dataset statistics in Appendix A. For arithmetic reasoning, we use GSM (Cobbe et al., 2021), GSM-SYS, and ALGEBRA (He-Yueya et al., 2023). |
| Dataset Splits | No | The paper discusses few-shot exemplars and test sets but does not explicitly specify a separate validation dataset split with percentages or counts for model training/tuning. |
| Hardware Specification | No | We conduct our main experiments and analysis on code-davinci-002 (Chen et al., 2021), a state-of-art LLM for code and code-adjacent tasks. |
| Software Dependencies | No | The paper mentions using 'Python syntax' and 'Z3 SMT solver' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We evaluate the performance with both greedy decoding and self-consistency decoding (Wang et al., 2022b). Following past work (Gao et al., 2023), we use 40 samples on all datasets except for LSAT, BOARDGAMEQA, and PROOFWRITER; we use 5 samples on these datasets involving long prompts and high computation cost. For COT and PROGLM, we use a temperature of 0.7; for SATLM, we use a higher temperature of 0.9, which we find to work better. |