Mathematical Reasoning via Self-supervised Skip-tree Training
Authors: Markus Norman Rabe, Dennis Lee, Kshitij Bansal, Christian Szegedy
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that self-supervised language modeling applied to mathematical formulas enables logical reasoning. To measure the logical reasoning abilities of language models, we formulate several evaluation (downstream) tasks, such as inferring types, suggesting missing assumptions, and completing equalities. For training language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skipsequence tasks. We trained a Transformer architecture on the skip-tree dataset and each of the ablations for up to 1M steps (=parameter updates) with a batch size of 256. This means our models are trained for about 10 epochs (depending on dataset size). Our models have 39M trainable parameters; the hyperparameters are specified in the appendix. We trained them on an 8x8 TPU configuration, which equates to 128 cores. The training runs took between 5 and 15 hours, depending on the average length of the output sequences, which translates to up to 1.4 and 4.2 Peta FLOPs days per training run. |
| Researcher Affiliation | Industry | Markus N. Rabe Google Research mrabe@google.com Dennis Lee Google Research ldennis@google.com Kshitij Bansal Google Research kbk@google.com Christian Szegedy Google Research szegedy@google.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not include an explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We start with the HOList dataset (Bansal et al., 2019), which spans a wide range of mathematical topics, including topology, multivariate calculus, real and complex analysis, geometric algebra, and measure theory, formalized in the HOL Light proof assistant (Harrison, 1996). We work with the HOList dataset by Bansal et al. (2019). |
| Dataset Splits | Yes | We use the same split as used in HOList (Bansal et al., 2019). They first split the theorems into training, validation, and test, and then assign all statements in entire proof of each theorem to the same split as the theorem. This avoids partially revealing the proofs of theorems in the validation and test sets during training. We derive all training data from the theorems and proofs in the training set, and use only the theorems (not the proofs) for the evaluation tasks. |
| Hardware Specification | Yes | We trained them on an 8x8 TPU configuration, which equates to 128 cores. The training runs took between 5 and 15 hours, depending on the average length of the output sequences, which translates to up to 1.4 and 4.2 Peta FLOPs days per training run. |
| Software Dependencies | No | The paper mentions using "Transformer architecture" and the "Deep HOL theorem prover" but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We trained a Transformer architecture on the skip-tree dataset and each of the ablations for up to 1M steps (=parameter updates) with a batch size of 256. Our models have 39M trainable parameters; the hyperparameters are specified in the appendix. Appendix A lists the following hyperparameters: vocabulary size: 1200, embedding size: 128, attention dropout: 0.1, nonlinearity: gelu, hidden layer dropout: 0.1, hidden layer size: 512, initializer range: 0.02, intermediate size: 2048, number of attention heads: 8, number of hidden layers in encoder: 8, number of hidden layers in decoder: 4, learning rate: 1e-4. |