Analysing Mathematical Reasoning Abilities of Neural Models
Authors: David Saxton, Edward Grefenstette, Felix Hill, Pushmeet Kohli
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge. |
| Researcher Affiliation | Industry | David Saxton Deep Mind saxton@google.comEdward Grefenstette Deep Mind egrefen@fb.comFelix Hill Deep Mind felixhill@google.comPushmeet Kohli Deep Mind pushmeet@google.com |
| Pseudocode | No | The paper describes the models examined and their architectures but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release1 a sequence-to-sequence dataset consisting of many different types of mathematics questions (see Figure 1) for measuring mathematical reasoning, with the provision of both generation code and pre-generated questions. 1Dataset will be available at https://github.com/deepmind/mathematics_dataset |
| Open Datasets | Yes | Dataset and generalization tests We release1 a sequence-to-sequence dataset consisting of many different types of mathematics questions (see Figure 1) for measuring mathematical reasoning, with the provision of both generation code and pre-generated questions. 1Dataset will be available at https://github.com/deepmind/mathematics_dataset |
| Dataset Splits | No | The paper states 'Per module, we generate 2 10^6 train questions, and 10^5 test (interpolation) questions.' and mentions 'validation performance' but does not specify the size or methodology of a validation split. |
| Hardware Specification | Yes | We use a batch size of 1024 split across 8 NVIDIA P100 GPUs for 500k batches |
| Software Dependencies | No | The paper mentions tools like Python/Sym Py and the Adam optimizer but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | We minimize the sum of log probabilities of the correct character via the Adam optimizer (Kingma & Ba, 2014) with learning rate of 6 10 4, β1 = 0.9, β2 = 0.995, ϵ = 10 9. We use a batch size of 1024 split across 8 NVIDIA P100 GPUs for 500k batches, with absolute gradient value clipping of 0.1. |