LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning

Authors: Yuhuai Wu, Markus N Rabe, Wenda Li, Jimmy Ba, Roger B Grosse, Christian Szegedy

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present results on four large mathematical reasoning tasks that are especially useful in the context of automated theorem proving. Our results show significant gains in learning inductive biases from synthetic tasks. We have selected four tasks to cover various different styles of interactive theorem provers
Researcher Affiliation Collaboration Yuhuai Wu 1 2 Markus Rabe 3 Wenda Li 4 Jimmy Ba 1 2 Roger Grosse 1 2 Christian Szegedy 3 *Equal contribution 1University of Toronto, Toronto, Canada 2Vector Institute, Toronto, Canada 3Google Research 4University of Cambridge, Cambridge, UK.
Pseudocode Yes The complete pseudo-code of generating the symbols, Rule, Case, and Result for one task example is provided in Appendix Algorithm 1.
Open Source Code Yes The code for generating LIME tasks is available at https: //github.com/tonywu95/LIME.
Open Datasets Yes We demonstrate that LIME pretrained models provide significant gains across four large mathematical reasoning benchmarks: Isar Step (Li et al., 2021), HOList Skip-tree (Rabe et al., 2021), Meta Math Step (Polu & Sutskever, 2020), and Lean Step (de Moura et al., 2015). For Isar Step: 'The dataset was mined from the public repository of formal proofs of the Isabelle proof assistant (Paulson, 1994).' For Lean Step: 'We extracted 498, 624 number of goal, next lemma pairs from Lean mathlib library (mathlib, 2020; Han et al., 2021).'
Dataset Splits Yes The Isar Step dataset has '820K, 5000, 5000 sequence pairs for the training, validation, and test sets'. For Meta Math Step: 'We split theorems into train/valid/test theorems of size 35K, 1K, 1K'. For Lean Step: 'We then randomly sampled 8k of lemmas from this set and used the corresponding goal lemma pairs for the validation and the tests (each 4k).'
Hardware Specification Yes The pretraining was done on a single Nvidia Tesla T4 GPU with 4 CPU cores for 2 hours. For the downstream task Isar Step and Meta Math Step, we used four Nvidia Tesla T4 GPU with 16 CPU cores for training. For the HOList skip-tree task, we used TPUs for running the experiments.
Software Dependencies No The paper mentions 'implemented using fairseq (Ott et al., 2019)' but does not specify a version number for fairseq or any other software dependency.
Experiment Setup Yes We set the maximum number of tokens in a batch to 4096, and accumulate four batches of gradients for one parameter update. We used the Adam optimizer (Kingma & Ba, 2015) with learning rate 3 10 4. We used a dropout rate of 0.1 and label smoothing (Szegedy et al., 2016) with a coefficient 0.1. All experiments used the transformer base model from (Vaswani et al., 2017), i.e. 512 hidden size, 2048 filter size, 8 attention heads. For the Isar Step and Meta Math Step task, we used 6 layers for both the encoder and decoder.