Solving Quantitative Reasoning Problems with Language Models
Authors: Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them. |
| Researcher Affiliation | Industry | Aitor Lewkowycz , Anders Andreassen , David Dohan , Ethan Dyer , Henryk Michalewski , Vinay Ramasesh , Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur , Guy Gur-Ari , Vedant Misra Google Research |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code and models are proprietary. |
| Open Datasets | No | Our models were trained on a dataset of 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the ar Xiv preprint server. The mathematical webpages dataset is proprietary. We plan to release these as part of an open-source dataset which will be detailed in an upcoming manuscript. |
| Dataset Splits | No | The paper mentions using existing datasets like MATH, GSM8k, and MMLU-STEM for few-shot evaluation, but it does not explicitly provide specific train/validation/test dataset split percentages or sample counts for reproduction. |
| Hardware Specification | Yes | Model training and inference were performed on Google Tensor Processing Units (TPU) v3 and v4. |
| Software Dependencies | No | The paper mentions using the 'Sym Py library' for correctness evaluation but does not provide specific version numbers for any software dependencies required to replicate the experiment. |
| Experiment Setup | Yes | Table 2: Model architecture and continued training hyperparameters. ... Model Layers Heads dmodel Parameters Steps Tokens Minerva 8B 32 16 4096 8.63B 624k 164B Minerva 62B 64 32 8192 62.50B 416k 109B Minerva 540B 118 48 18432 540.35B 399k 26B. For evaluation, we truncate the inputs from the left to 1024 tokens and we use the model to generate up to 512 tokens. When sampling once per problem, we sample greedily. When sampling multiple times per problem we use nucleus sampling [42] with temperature T = 0.6, p = 0.95. |