Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization
Authors: Jin Peng Zhou, Charles E Staats, Wenda Li, Christian Szegedy, Kilian Q Weinberger, Yuhuai Wu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on GSM8K, MATH and Multi Arith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes. |
| Researcher Affiliation | Collaboration | Jin Peng Zhou Cornell University Charles Staats Wenda Li University of Edinburgh Christian Szegedy x AI Kilian Q. Weinberger Cornell University Yuhuai Wu x AI |
| Pseudocode | No | The paper describes the method using prose and a diagram (Figure 1), but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code can be found at https://github.com/jinpz/dtv. |
| Open Datasets | Yes | We evaluate DTV on three quantitative reasoning datasets: GSM8K (Cobbe et al., 2021), three subsets of MATH (Hendrycks et al., 2021) following prior work (Zheng et al., 2021), and Multi Arith (Roy & Roth, 2016) datasets. |
| Dataset Splits | No | The paper evaluates on existing datasets (GSM8K, MATH, Multi Arith) which are referred to as 'evaluation sets', but does not specify the explicit training, validation, and test splits (e.g., percentages or counts) for their own experiments or how data was partitioned for reproduction. |
| Hardware Specification | No | The paper mentions the use of Minerva models and GPT3.5, and refers to model sizes (e.g., 8B, 62B, 540B), but does not specify the exact hardware (e.g., GPU models, CPU models, memory) on which the experiments were run. |
| Software Dependencies | No | The paper mentions Isabelle, Portal-to-ISAbelle, and Sledgehammer, along with various tactics, but does not specify version numbers for these software components. |
| Experiment Setup | Yes | We use the default sampling configuration (T = 0.6, nucleus sampling (Holtzman et al., 2019) p = 0.95) reported in Lewkowycz et al. (2022) for Minerva models. We prepare 25 paired (informal statement, Isabelle formal statement) examples for each of the three categories from MATH dataset, GSM8K and Multi Airth datasets as the candidates for few-shot demonstrations. When prompting the model to formalize an informal statement, we randomly draw and permute 10 examples to form the few-shot prompt. |