Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Authors: Jin Peng Zhou, Charles E Staats, Wenda Li, Christian Szegedy, Kilian Q Weinberger, Yuhuai Wu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on GSM8K, MATH and Multi Arith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes.
Researcher Affiliation Collaboration Jin Peng Zhou Cornell University Charles Staats Wenda Li University of Edinburgh Christian Szegedy x AI Kilian Q. Weinberger Cornell University Yuhuai Wu x AI
Pseudocode No The paper describes the method using prose and a diagram (Figure 1), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes The code can be found at https://github.com/jinpz/dtv.
Open Datasets Yes We evaluate DTV on three quantitative reasoning datasets: GSM8K (Cobbe et al., 2021), three subsets of MATH (Hendrycks et al., 2021) following prior work (Zheng et al., 2021), and Multi Arith (Roy & Roth, 2016) datasets.
Dataset Splits No The paper evaluates on existing datasets (GSM8K, MATH, Multi Arith) which are referred to as 'evaluation sets', but does not specify the explicit training, validation, and test splits (e.g., percentages or counts) for their own experiments or how data was partitioned for reproduction.
Hardware Specification No The paper mentions the use of Minerva models and GPT3.5, and refers to model sizes (e.g., 8B, 62B, 540B), but does not specify the exact hardware (e.g., GPU models, CPU models, memory) on which the experiments were run.
Software Dependencies No The paper mentions Isabelle, Portal-to-ISAbelle, and Sledgehammer, along with various tactics, but does not specify version numbers for these software components.
Experiment Setup Yes We use the default sampling configuration (T = 0.6, nucleus sampling (Holtzman et al., 2019) p = 0.95) reported in Lewkowycz et al. (2022) for Minerva models. We prepare 25 paired (informal statement, Isabelle formal statement) examples for each of the three categories from MATH dataset, GSM8K and Multi Airth datasets as the candidates for few-shot demonstrations. When prompting the model to formalize an informal statement, we randomly draw and permute 10 examples to form the few-shot prompt.