LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning
Authors: Yuhuai Wu, Markus N Rabe, Wenda Li, Jimmy Ba, Roger B Grosse, Christian Szegedy
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present results on four large mathematical reasoning tasks that are especially useful in the context of automated theorem proving. Our results show significant gains in learning inductive biases from synthetic tasks. We have selected four tasks to cover various different styles of interactive theorem provers |
| Researcher Affiliation | Collaboration | Yuhuai Wu 1 2 Markus Rabe 3 Wenda Li 4 Jimmy Ba 1 2 Roger Grosse 1 2 Christian Szegedy 3 *Equal contribution 1University of Toronto, Toronto, Canada 2Vector Institute, Toronto, Canada 3Google Research 4University of Cambridge, Cambridge, UK. |
| Pseudocode | Yes | The complete pseudo-code of generating the symbols, Rule, Case, and Result for one task example is provided in Appendix Algorithm 1. |
| Open Source Code | Yes | The code for generating LIME tasks is available at https: //github.com/tonywu95/LIME. |
| Open Datasets | Yes | We demonstrate that LIME pretrained models provide significant gains across four large mathematical reasoning benchmarks: Isar Step (Li et al., 2021), HOList Skip-tree (Rabe et al., 2021), Meta Math Step (Polu & Sutskever, 2020), and Lean Step (de Moura et al., 2015). For Isar Step: 'The dataset was mined from the public repository of formal proofs of the Isabelle proof assistant (Paulson, 1994).' For Lean Step: 'We extracted 498, 624 number of goal, next lemma pairs from Lean mathlib library (mathlib, 2020; Han et al., 2021).' |
| Dataset Splits | Yes | The Isar Step dataset has '820K, 5000, 5000 sequence pairs for the training, validation, and test sets'. For Meta Math Step: 'We split theorems into train/valid/test theorems of size 35K, 1K, 1K'. For Lean Step: 'We then randomly sampled 8k of lemmas from this set and used the corresponding goal lemma pairs for the validation and the tests (each 4k).' |
| Hardware Specification | Yes | The pretraining was done on a single Nvidia Tesla T4 GPU with 4 CPU cores for 2 hours. For the downstream task Isar Step and Meta Math Step, we used four Nvidia Tesla T4 GPU with 16 CPU cores for training. For the HOList skip-tree task, we used TPUs for running the experiments. |
| Software Dependencies | No | The paper mentions 'implemented using fairseq (Ott et al., 2019)' but does not specify a version number for fairseq or any other software dependency. |
| Experiment Setup | Yes | We set the maximum number of tokens in a batch to 4096, and accumulate four batches of gradients for one parameter update. We used the Adam optimizer (Kingma & Ba, 2015) with learning rate 3 10 4. We used a dropout rate of 0.1 and label smoothing (Szegedy et al., 2016) with a coefficient 0.1. All experiments used the transformer base model from (Vaswani et al., 2017), i.e. 512 hidden size, 2048 filter size, 8 attention heads. For the Isar Step and Meta Math Step task, we used 6 layers for both the encoder and decoder. |