reproducibilityindex.ai

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning

Authors: Yuhuai Wu, Markus N Rabe, Wenda Li, Jimmy Ba, Roger B Grosse, Christian Szegedy

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present results on four large mathematical reasoning tasks that are especially useful in the context of automated theorem proving. Our results show signiﬁcant gains in learning inductive biases from synthetic tasks. We have selected four tasks to cover various different styles of interactive theorem provers
Researcher Affiliation	Collaboration	Yuhuai Wu 1 2 Markus Rabe 3 Wenda Li 4 Jimmy Ba 1 2 Roger Grosse 1 2 Christian Szegedy 3 *Equal contribution 1University of Toronto, Toronto, Canada 2Vector Institute, Toronto, Canada 3Google Research 4University of Cambridge, Cambridge, UK.
Pseudocode	Yes	The complete pseudo-code of generating the symbols, Rule, Case, and Result for one task example is provided in Appendix Algorithm 1.
Open Source Code	Yes	The code for generating LIME tasks is available at https: //github.com/tonywu95/LIME.
Open Datasets	Yes	We demonstrate that LIME pretrained models provide significant gains across four large mathematical reasoning benchmarks: Isar Step (Li et al., 2021), HOList Skip-tree (Rabe et al., 2021), Meta Math Step (Polu & Sutskever, 2020), and Lean Step (de Moura et al., 2015). For Isar Step: 'The dataset was mined from the public repository of formal proofs of the Isabelle proof assistant (Paulson, 1994).' For Lean Step: 'We extracted 498, 624 number of goal, next lemma pairs from Lean mathlib library (mathlib, 2020; Han et al., 2021).'
Dataset Splits	Yes	The Isar Step dataset has '820K, 5000, 5000 sequence pairs for the training, validation, and test sets'. For Meta Math Step: 'We split theorems into train/valid/test theorems of size 35K, 1K, 1K'. For Lean Step: 'We then randomly sampled 8k of lemmas from this set and used the corresponding goal lemma pairs for the validation and the tests (each 4k).'
Hardware Specification	Yes	The pretraining was done on a single Nvidia Tesla T4 GPU with 4 CPU cores for 2 hours. For the downstream task Isar Step and Meta Math Step, we used four Nvidia Tesla T4 GPU with 16 CPU cores for training. For the HOList skip-tree task, we used TPUs for running the experiments.
Software Dependencies	No	The paper mentions 'implemented using fairseq (Ott et al., 2019)' but does not specify a version number for fairseq or any other software dependency.
Experiment Setup	Yes	We set the maximum number of tokens in a batch to 4096, and accumulate four batches of gradients for one parameter update. We used the Adam optimizer (Kingma & Ba, 2015) with learning rate 3 10 4. We used a dropout rate of 0.1 and label smoothing (Szegedy et al., 2016) with a coefﬁcient 0.1. All experiments used the transformer base model from (Vaswani et al., 2017), i.e. 512 hidden size, 2048 ﬁlter size, 8 attention heads. For the Isar Step and Meta Math Step task, we used 6 layers for both the encoder and decoder.