reproducibilityindex.ai

Learning Formal Mathematics From Intrinsic Motivation

Authors: Gabriel Poesia, David Broman, Nick Haber, Noah Goodman

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 3 axiomatic domains (propositional logic, arithmetic and group theory) demonstrate that our agent can bootstrap from only the axioms, self-improving in generating true and challenging conjectures and in finding proofs.
Researcher Affiliation	Academia	Departments of Computer Science1, Psychology2 and Education3, Stanford University, USA EECS and Digital Futures4, KTH Royal Institute of Technology, Sweden
Pseudocode	No	The paper describes the methods in prose but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code	Yes	We provide a zip file containing all of our implementation in our supplementary material, and the configuration files and commands used to run the main experiments. This was also released on Github (https://github.com/gpoesia/minimo).
Open Datasets	Yes	For logic, we take the set of 35 theorems of propositional logic from Stephen Kleene s textbook Introduction to Metamathematics [20]. ... For arithmetic, we use the Natural Number Game [6]... We translate the statements into Peano, and evaluate our agents on their success rate on those problems...
Dataset Splits	No	The paper does not explicitly describe specific training, validation, and test splits with percentages, counts, or explicit standard split names. It mentions "training on collected examples" and "unseen conjectures" but not formal splits.
Hardware Specification	Yes	Our training runs (5 iterations of generating and proving 200 conjectures in each) were done on 2 machines with 5 NVIDIA A40 40GB GPUs each.
Software Dependencies	No	The paper mentions "GPT-2-style character-level Transformer models" and "Adam W optimizer" but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, etc.
Experiment Setup	Yes	We represent our agents with GPT-2-style character-level Transformer models totalling approximately 8.45M parameters each (8 layers, 8 attention heads, hidden size 512, feed-forward size 2048, vocabulary size 128, absolute positional embeddings, with maximum context size of 1024). After each training iteration, we train the LM for a fixed number of 2000 steps of the Adam W optimizer (learning rate of 1e 4) with a dynamic batch size of most 10000 characters (random examples are added until their padded sequence lengths add up to more than 10000).