INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

Authors: Yuhuai Wu, Albert Jiang, Jimmy Ba, Roger Baker Grosse

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the following experiments, we used the proofs generated by the INT generator to perform behavior cloning. We then evaluated the success rates of trained agents in a theorem proving environment.
Researcher Affiliation Academia Yuhuai Wu , Albert Qiaochu Jiang , Jimmy Ba & Roger Grosse University of Toronto & Vector Institute {ywu, ajiang, jba, rgrosse}@cs.toronto.edu
Pseudocode Yes We provide the theorem generation algorithm in Algorithm 1.
Open Source Code Yes The code for generating theorems and conducting experiments is available at https://github.com/albertqjiang/INT.
Open Datasets Yes For each theorem distribution, we first generated a fixed test set of 1000 problems, and then produced training problems in an online fashion, while making sure the training problems were different from the test ones.
Dataset Splits Yes For each experiment, we ran 2 random seeds, and picked the one with higher validation success rates for test evaluation.
Hardware Specification Yes We used one Nvidia P100 or Tesla T4 GPU with 4 CPU cores for training.
Software Dependencies No The paper mentions 'fairseq (Ott et al., 2019)', 'Pytorch Geometric (Fey and Lenssen, 2019)', and 'DGL (Wang et al., 2019)' as libraries used, but does not provide specific version numbers for these software components.
Experiment Setup Yes We used the Adam optimizer (Kingma and Ba, 2015). We searched over the learning rates {10 5, 3 10 5, 10 4, 3 10 4} in preliminary experiments and found 10 4 to be the best choice, which was used for following experiments.