reproducibilityindex.ai

Solving Quantitative Reasoning Problems with Language Models

Authors: Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and ﬁnd that the model can correctly answer nearly a third of them.
Researcher Affiliation	Industry	Aitor Lewkowycz , Anders Andreassen , David Dohan , Ethan Dyer , Henryk Michalewski , Vinay Ramasesh , Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur , Guy Gur-Ari , Vedant Misra Google Research
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The code and models are proprietary.
Open Datasets	No	Our models were trained on a dataset of 38.5B tokens from webpages ﬁltered for mathematical content and from papers submitted to the ar Xiv preprint server. The mathematical webpages dataset is proprietary. We plan to release these as part of an open-source dataset which will be detailed in an upcoming manuscript.
Dataset Splits	No	The paper mentions using existing datasets like MATH, GSM8k, and MMLU-STEM for few-shot evaluation, but it does not explicitly provide specific train/validation/test dataset split percentages or sample counts for reproduction.
Hardware Specification	Yes	Model training and inference were performed on Google Tensor Processing Units (TPU) v3 and v4.
Software Dependencies	No	The paper mentions using the 'Sym Py library' for correctness evaluation but does not provide specific version numbers for any software dependencies required to replicate the experiment.
Experiment Setup	Yes	Table 2: Model architecture and continued training hyperparameters. ... Model Layers Heads dmodel Parameters Steps Tokens Minerva 8B 32 16 4096 8.63B 624k 164B Minerva 62B 64 32 8192 62.50B 416k 109B Minerva 540B 118 48 18432 540.35B 399k 26B. For evaluation, we truncate the inputs from the left to 1024 tokens and we use the model to generate up to 512 tokens. When sampling once per problem, we sample greedily. When sampling multiple times per problem we use nucleus sampling [42] with temperature T = 0.6, p = 0.95.