INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving
Authors: Yuhuai Wu, Albert Jiang, Jimmy Ba, Roger Baker Grosse
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the following experiments, we used the proofs generated by the INT generator to perform behavior cloning. We then evaluated the success rates of trained agents in a theorem proving environment. |
| Researcher Affiliation | Academia | Yuhuai Wu , Albert Qiaochu Jiang , Jimmy Ba & Roger Grosse University of Toronto & Vector Institute {ywu, ajiang, jba, rgrosse}@cs.toronto.edu |
| Pseudocode | Yes | We provide the theorem generation algorithm in Algorithm 1. |
| Open Source Code | Yes | The code for generating theorems and conducting experiments is available at https://github.com/albertqjiang/INT. |
| Open Datasets | Yes | For each theorem distribution, we first generated a fixed test set of 1000 problems, and then produced training problems in an online fashion, while making sure the training problems were different from the test ones. |
| Dataset Splits | Yes | For each experiment, we ran 2 random seeds, and picked the one with higher validation success rates for test evaluation. |
| Hardware Specification | Yes | We used one Nvidia P100 or Tesla T4 GPU with 4 CPU cores for training. |
| Software Dependencies | No | The paper mentions 'fairseq (Ott et al., 2019)', 'Pytorch Geometric (Fey and Lenssen, 2019)', and 'DGL (Wang et al., 2019)' as libraries used, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We used the Adam optimizer (Kingma and Ba, 2015). We searched over the learning rates {10 5, 3 10 5, 10 4, 3 10 4} in preliminary experiments and found 10 4 to be the best choice, which was used for following experiments. |