Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Analysing Mathematical Reasoning Abilities of Neural Models
Authors: David Saxton, Edward Grefenstette, Felix Hill, Pushmeet Kohli
ICLR 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and ο¬nd notable differences in their ability to resolve mathematical problems and generalize their knowledge. |
| Researcher Affiliation | Industry | David Saxton Deep Mind EMAIL Grefenstette Deep Mind EMAIL Hill Deep Mind EMAIL Kohli Deep Mind EMAIL |
| Pseudocode | No | The paper describes the models examined and their architectures but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release1 a sequence-to-sequence dataset consisting of many different types of mathematics questions (see Figure 1) for measuring mathematical reasoning, with the provision of both generation code and pre-generated questions. 1Dataset will be available at https://github.com/deepmind/mathematics_dataset |
| Open Datasets | Yes | Dataset and generalization tests We release1 a sequence-to-sequence dataset consisting of many different types of mathematics questions (see Figure 1) for measuring mathematical reasoning, with the provision of both generation code and pre-generated questions. 1Dataset will be available at https://github.com/deepmind/mathematics_dataset |
| Dataset Splits | No | The paper states 'Per module, we generate 2 10^6 train questions, and 10^5 test (interpolation) questions.' and mentions 'validation performance' but does not specify the size or methodology of a validation split. |
| Hardware Specification | Yes | We use a batch size of 1024 split across 8 NVIDIA P100 GPUs for 500k batches |
| Software Dependencies | No | The paper mentions tools like Python/Sym Py and the Adam optimizer but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | We minimize the sum of log probabilities of the correct character via the Adam optimizer (Kingma & Ba, 2014) with learning rate of 6 10 4, Ξ²1 = 0.9, Ξ²2 = 0.995, Ο΅ = 10 9. We use a batch size of 1024 split across 8 NVIDIA P100 GPUs for 500k batches, with absolute gradient value clipping of 0.1. |