reproducibilityindex.ai

Mathematical Reasoning via Self-supervised Skip-tree Training

Authors: Markus Norman Rabe, Dennis Lee, Kshitij Bansal, Christian Szegedy

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that self-supervised language modeling applied to mathematical formulas enables logical reasoning. To measure the logical reasoning abilities of language models, we formulate several evaluation (downstream) tasks, such as inferring types, suggesting missing assumptions, and completing equalities. For training language models for formal mathematics, we propose a novel skip-tree task. We ﬁnd that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skipsequence tasks. We trained a Transformer architecture on the skip-tree dataset and each of the ablations for up to 1M steps (=parameter updates) with a batch size of 256. This means our models are trained for about 10 epochs (depending on dataset size). Our models have 39M trainable parameters; the hyperparameters are speciﬁed in the appendix. We trained them on an 8x8 TPU conﬁguration, which equates to 128 cores. The training runs took between 5 and 15 hours, depending on the average length of the output sequences, which translates to up to 1.4 and 4.2 Peta FLOPs days per training run.
Researcher Affiliation	Industry	Markus N. Rabe Google Research mrabe@google.com Dennis Lee Google Research ldennis@google.com Kshitij Bansal Google Research kbk@google.com Christian Szegedy Google Research szegedy@google.com
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not include an explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets	Yes	We start with the HOList dataset (Bansal et al., 2019), which spans a wide range of mathematical topics, including topology, multivariate calculus, real and complex analysis, geometric algebra, and measure theory, formalized in the HOL Light proof assistant (Harrison, 1996). We work with the HOList dataset by Bansal et al. (2019).
Dataset Splits	Yes	We use the same split as used in HOList (Bansal et al., 2019). They ﬁrst split the theorems into training, validation, and test, and then assign all statements in entire proof of each theorem to the same split as the theorem. This avoids partially revealing the proofs of theorems in the validation and test sets during training. We derive all training data from the theorems and proofs in the training set, and use only the theorems (not the proofs) for the evaluation tasks.
Hardware Specification	Yes	We trained them on an 8x8 TPU conﬁguration, which equates to 128 cores. The training runs took between 5 and 15 hours, depending on the average length of the output sequences, which translates to up to 1.4 and 4.2 Peta FLOPs days per training run.
Software Dependencies	No	The paper mentions using "Transformer architecture" and the "Deep HOL theorem prover" but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	Yes	We trained a Transformer architecture on the skip-tree dataset and each of the ablations for up to 1M steps (=parameter updates) with a batch size of 256. Our models have 39M trainable parameters; the hyperparameters are speciﬁed in the appendix. Appendix A lists the following hyperparameters: vocabulary size: 1200, embedding size: 128, attention dropout: 0.1, nonlinearity: gelu, hidden layer dropout: 0.1, hidden layer size: 512, initializer range: 0.02, intermediate size: 2048, number of attention heads: 8, number of hidden layers in encoder: 8, number of hidden layers in decoder: 4, learning rate: 1e-4.