reproducibilityindex.ai

INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

Authors: Yuhuai Wu, Albert Jiang, Jimmy Ba, Roger Baker Grosse

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the following experiments, we used the proofs generated by the INT generator to perform behavior cloning. We then evaluated the success rates of trained agents in a theorem proving environment.
Researcher Affiliation	Academia	Yuhuai Wu , Albert Qiaochu Jiang , Jimmy Ba & Roger Grosse University of Toronto & Vector Institute {ywu, ajiang, jba, rgrosse}@cs.toronto.edu
Pseudocode	Yes	We provide the theorem generation algorithm in Algorithm 1.
Open Source Code	Yes	The code for generating theorems and conducting experiments is available at https://github.com/albertqjiang/INT.
Open Datasets	Yes	For each theorem distribution, we ﬁrst generated a ﬁxed test set of 1000 problems, and then produced training problems in an online fashion, while making sure the training problems were different from the test ones.
Dataset Splits	Yes	For each experiment, we ran 2 random seeds, and picked the one with higher validation success rates for test evaluation.
Hardware Specification	Yes	We used one Nvidia P100 or Tesla T4 GPU with 4 CPU cores for training.
Software Dependencies	No	The paper mentions 'fairseq (Ott et al., 2019)', 'Pytorch Geometric (Fey and Lenssen, 2019)', and 'DGL (Wang et al., 2019)' as libraries used, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We used the Adam optimizer (Kingma and Ba, 2015). We searched over the learning rates {10 5, 3 10 5, 10 4, 3 10 4} in preliminary experiments and found 10 4 to be the best choice, which was used for following experiments.