reproducibilityindex.ai

SatLM: Satisfiability-Aided Language Models Using Declarative Prompting

Authors: Xi Ye, Qiaochu Chen, Isil Dillig, Greg Durrett

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
Researcher Affiliation	Academia	Xi Ye Qiaochu Chen Isil Dillig Greg Durrett Department of Computer Science The University of Texas at Austin {xiye,qchen,isil,gdurrett}@cs.utexas.edu
Pseudocode	Yes	An example of the specification is presented on the right of Figure 1. Our specification is declarative since we do not explicitly generate the ri from the LLM at this stage.
Open Source Code	Yes	1Code available at https://github.com/xiye17/SAT-LM.
Open Datasets	Yes	We list all dataset statistics in Appendix A. For arithmetic reasoning, we use GSM (Cobbe et al., 2021), GSM-SYS, and ALGEBRA (He-Yueya et al., 2023).
Dataset Splits	No	The paper discusses few-shot exemplars and test sets but does not explicitly specify a separate validation dataset split with percentages or counts for model training/tuning.
Hardware Specification	No	We conduct our main experiments and analysis on code-davinci-002 (Chen et al., 2021), a state-of-art LLM for code and code-adjacent tasks.
Software Dependencies	No	The paper mentions using 'Python syntax' and 'Z3 SMT solver' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We evaluate the performance with both greedy decoding and self-consistency decoding (Wang et al., 2022b). Following past work (Gao et al., 2023), we use 40 samples on all datasets except for LSAT, BOARDGAMEQA, and PROOFWRITER; we use 5 samples on these datasets involving long prompts and high computation cost. For COT and PROGLM, we use a temperature of 0.7; for SATLM, we use a higher temperature of 0.9, which we find to work better.