reproducibilityindex.ai

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Authors: Jin Peng Zhou, Charles E Staats, Wenda Li, Christian Szegedy, Kilian Q Weinberger, Yuhuai Wu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on GSM8K, MATH and Multi Arith datasets and demonstrate that our approach provides a consistently better heuristic than vanilla majority voting the previously best method to identify correct answers, by more than 12% on GSM8K. In our experiments it improves results consistently across all datasets and LLM model sizes.
Researcher Affiliation	Collaboration	Jin Peng Zhou Cornell University Charles Staats Wenda Li University of Edinburgh Christian Szegedy x AI Kilian Q. Weinberger Cornell University Yuhuai Wu x AI
Pseudocode	No	The paper describes the method using prose and a diagram (Figure 1), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The code can be found at https://github.com/jinpz/dtv.
Open Datasets	Yes	We evaluate DTV on three quantitative reasoning datasets: GSM8K (Cobbe et al., 2021), three subsets of MATH (Hendrycks et al., 2021) following prior work (Zheng et al., 2021), and Multi Arith (Roy & Roth, 2016) datasets.
Dataset Splits	No	The paper evaluates on existing datasets (GSM8K, MATH, Multi Arith) which are referred to as 'evaluation sets', but does not specify the explicit training, validation, and test splits (e.g., percentages or counts) for their own experiments or how data was partitioned for reproduction.
Hardware Specification	No	The paper mentions the use of Minerva models and GPT3.5, and refers to model sizes (e.g., 8B, 62B, 540B), but does not specify the exact hardware (e.g., GPU models, CPU models, memory) on which the experiments were run.
Software Dependencies	No	The paper mentions Isabelle, Portal-to-ISAbelle, and Sledgehammer, along with various tactics, but does not specify version numbers for these software components.
Experiment Setup	Yes	We use the default sampling configuration (T = 0.6, nucleus sampling (Holtzman et al., 2019) p = 0.95) reported in Lewkowycz et al. (2022) for Minerva models. We prepare 25 paired (informal statement, Isabelle formal statement) examples for each of the three categories from MATH dataset, GSM8K and Multi Airth datasets as the candidates for few-shot demonstrations. When prompting the model to formalize an informal statement, we randomly draw and permute 10 examples to form the few-shot prompt.