Solving Quantitative Reasoning Problems with Language Models

Authors: Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.
Researcher Affiliation Industry Aitor Lewkowycz , Anders Andreassen , David Dohan , Ethan Dyer , Henryk Michalewski , Vinay Ramasesh , Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur , Guy Gur-Ari , Vedant Misra Google Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The code and models are proprietary.
Open Datasets No Our models were trained on a dataset of 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the ar Xiv preprint server. The mathematical webpages dataset is proprietary. We plan to release these as part of an open-source dataset which will be detailed in an upcoming manuscript.
Dataset Splits No The paper mentions using existing datasets like MATH, GSM8k, and MMLU-STEM for few-shot evaluation, but it does not explicitly provide specific train/validation/test dataset split percentages or sample counts for reproduction.
Hardware Specification Yes Model training and inference were performed on Google Tensor Processing Units (TPU) v3 and v4.
Software Dependencies No The paper mentions using the 'Sym Py library' for correctness evaluation but does not provide specific version numbers for any software dependencies required to replicate the experiment.
Experiment Setup Yes Table 2: Model architecture and continued training hyperparameters. ... Model Layers Heads dmodel Parameters Steps Tokens Minerva 8B 32 16 4096 8.63B 624k 164B Minerva 62B 64 32 8192 62.50B 416k 109B Minerva 540B 118 48 18432 540.35B 399k 26B. For evaluation, we truncate the inputs from the left to 1024 tokens and we use the model to generate up to 512 tokens. When sampling once per problem, we sample greedily. When sampling multiple times per problem we use nucleus sampling [42] with temperature T = 0.6, p = 0.95.