Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting
Authors: Xi Ye, Qiaochu Chen, Isil Dillig, Greg Durrett
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm. |
| Researcher Affiliation | Academia | Xi Ye Qiaochu Chen Isil Dillig Greg Durrett Department of Computer Science The University of Texas at Austin EMAIL |
| Pseudocode | Yes | An example of the specification is presented on the right of Figure 1. Our specification is declarative since we do not explicitly generate the ri from the LLM at this stage. |
| Open Source Code | Yes | 1Code available at https://github.com/xiye17/SAT-LM. |
| Open Datasets | Yes | We list all dataset statistics in Appendix A. For arithmetic reasoning, we use GSM (Cobbe et al., 2021), GSM-SYS, and ALGEBRA (He-Yueya et al., 2023). |
| Dataset Splits | No | The paper discusses few-shot exemplars and test sets but does not explicitly specify a separate validation dataset split with percentages or counts for model training/tuning. |
| Hardware Specification | No | We conduct our main experiments and analysis on code-davinci-002 (Chen et al., 2021), a state-of-art LLM for code and code-adjacent tasks. |
| Software Dependencies | No | The paper mentions using 'Python syntax' and 'Z3 SMT solver' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We evaluate the performance with both greedy decoding and self-consistency decoding (Wang et al., 2022b). Following past work (Gao et al., 2023), we use 40 samples on all datasets except for LSAT, BOARDGAMEQA, and PROOFWRITER; we use 5 samples on these datasets involving long prompts and high computation cost. For COT and PROGLM, we use a temperature of 0.7; for SATLM, we use a higher temperature of 0.9, which we find to work better. |