Mathematical Reasoning via Self-supervised Skip-tree Training

Authors: Markus Norman Rabe, Dennis Lee, Kshitij Bansal, Christian Szegedy

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that self-supervised language modeling applied to mathematical formulas enables logical reasoning. To measure the logical reasoning abilities of language models, we formulate several evaluation (downstream) tasks, such as inferring types, suggesting missing assumptions, and completing equalities. For training language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skipsequence tasks. We trained a Transformer architecture on the skip-tree dataset and each of the ablations for up to 1M steps (=parameter updates) with a batch size of 256. This means our models are trained for about 10 epochs (depending on dataset size). Our models have 39M trainable parameters; the hyperparameters are specified in the appendix. We trained them on an 8x8 TPU configuration, which equates to 128 cores. The training runs took between 5 and 15 hours, depending on the average length of the output sequences, which translates to up to 1.4 and 4.2 Peta FLOPs days per training run.
Researcher Affiliation Industry Markus N. Rabe Google Research mrabe@google.com Dennis Lee Google Research ldennis@google.com Kshitij Bansal Google Research kbk@google.com Christian Szegedy Google Research szegedy@google.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not include an explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets Yes We start with the HOList dataset (Bansal et al., 2019), which spans a wide range of mathematical topics, including topology, multivariate calculus, real and complex analysis, geometric algebra, and measure theory, formalized in the HOL Light proof assistant (Harrison, 1996). We work with the HOList dataset by Bansal et al. (2019).
Dataset Splits Yes We use the same split as used in HOList (Bansal et al., 2019). They first split the theorems into training, validation, and test, and then assign all statements in entire proof of each theorem to the same split as the theorem. This avoids partially revealing the proofs of theorems in the validation and test sets during training. We derive all training data from the theorems and proofs in the training set, and use only the theorems (not the proofs) for the evaluation tasks.
Hardware Specification Yes We trained them on an 8x8 TPU configuration, which equates to 128 cores. The training runs took between 5 and 15 hours, depending on the average length of the output sequences, which translates to up to 1.4 and 4.2 Peta FLOPs days per training run.
Software Dependencies No The paper mentions using "Transformer architecture" and the "Deep HOL theorem prover" but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes We trained a Transformer architecture on the skip-tree dataset and each of the ablations for up to 1M steps (=parameter updates) with a batch size of 256. Our models have 39M trainable parameters; the hyperparameters are specified in the appendix. Appendix A lists the following hyperparameters: vocabulary size: 1200, embedding size: 128, attention dropout: 0.1, nonlinearity: gelu, hidden layer dropout: 0.1, hidden layer size: 512, initializer range: 0.02, intermediate size: 2048, number of attention heads: 8, number of hidden layers in encoder: 8, number of hidden layers in decoder: 4, learning rate: 1e-4.