Analysing Mathematical Reasoning Abilities of Neural Models

Authors: David Saxton, Edward Grefenstette, Felix Hill, Pushmeet Kohli

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge.
Researcher Affiliation Industry David Saxton Deep Mind saxton@google.comEdward Grefenstette Deep Mind egrefen@fb.comFelix Hill Deep Mind felixhill@google.comPushmeet Kohli Deep Mind pushmeet@google.com
Pseudocode No The paper describes the models examined and their architectures but does not include any pseudocode or algorithm blocks.
Open Source Code Yes We release1 a sequence-to-sequence dataset consisting of many different types of mathematics questions (see Figure 1) for measuring mathematical reasoning, with the provision of both generation code and pre-generated questions. 1Dataset will be available at https://github.com/deepmind/mathematics_dataset
Open Datasets Yes Dataset and generalization tests We release1 a sequence-to-sequence dataset consisting of many different types of mathematics questions (see Figure 1) for measuring mathematical reasoning, with the provision of both generation code and pre-generated questions. 1Dataset will be available at https://github.com/deepmind/mathematics_dataset
Dataset Splits No The paper states 'Per module, we generate 2 10^6 train questions, and 10^5 test (interpolation) questions.' and mentions 'validation performance' but does not specify the size or methodology of a validation split.
Hardware Specification Yes We use a batch size of 1024 split across 8 NVIDIA P100 GPUs for 500k batches
Software Dependencies No The paper mentions tools like Python/Sym Py and the Adam optimizer but does not provide specific version numbers for software dependencies.
Experiment Setup Yes We minimize the sum of log probabilities of the correct character via the Adam optimizer (Kingma & Ba, 2014) with learning rate of 6 10 4, β1 = 0.9, β2 = 0.995, ϵ = 10 9. We use a batch size of 1024 split across 8 NVIDIA P100 GPUs for 500k batches, with absolute gradient value clipping of 0.1.