Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Analysing Mathematical Reasoning Abilities of Neural Models
Authors: David Saxton, Edward Grefenstette, Felix Hill, Pushmeet Kohli
ICLR 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and ο¬nd notable differences in their ability to resolve mathematical problems and generalize their knowledge. |
| Researcher Affiliation | Industry | David Saxton Deep Mind EMAIL Grefenstette Deep Mind EMAIL Hill Deep Mind EMAIL Kohli Deep Mind EMAIL |
| Pseudocode | No | The paper describes the models examined and their architectures but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release1 a sequence-to-sequence dataset consisting of many different types of mathematics questions (see Figure 1) for measuring mathematical reasoning, with the provision of both generation code and pre-generated questions. 1Dataset will be available at https://github.com/deepmind/mathematics_dataset |
| Open Datasets | Yes | Dataset and generalization tests We release1 a sequence-to-sequence dataset consisting of many different types of mathematics questions (see Figure 1) for measuring mathematical reasoning, with the provision of both generation code and pre-generated questions. 1Dataset will be available at https://github.com/deepmind/mathematics_dataset |
| Dataset Splits | No | The paper states 'Per module, we generate 2 10^6 train questions, and 10^5 test (interpolation) questions.' and mentions 'validation performance' but does not specify the size or methodology of a validation split. |
| Hardware Specification | Yes | We use a batch size of 1024 split across 8 NVIDIA P100 GPUs for 500k batches |
| Software Dependencies | No | The paper mentions tools like Python/Sym Py and the Adam optimizer but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | We minimize the sum of log probabilities of the correct character via the Adam optimizer (Kingma & Ba, 2014) with learning rate of 6 10 4, Ξ²1 = 0.9, Ξ²2 = 0.995, Ο΅ = 10 9. We use a batch size of 1024 split across 8 NVIDIA P100 GPUs for 500k batches, with absolute gradient value clipping of 0.1. |