Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Llemma: An Open Language Model for Mathematics
Authors: Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the MATH benchmark LLEMMA outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, LLEMMA is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.1 |
| Researcher Affiliation | Collaboration | 1 Princeton University 2 Eleuther AI 3 University of Toronto 4 Vector Institute 5 University of Cambridge 6 Carnegie Mellon University 7 University of Washington |
| Pseudocode | No | The paper does not include pseudocode or clearly labeled algorithm blocks. It describes the methods and training procedures in narrative text. |
| Open Source Code | Yes | We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.1 |
| Open Datasets | Yes | We publicly release Proof-Pile-2 at hf.co/datasets/Eleuther AI/proof-pile-2. |
| Dataset Splits | Yes | We release a canonical train, validation, and test split of the dataset, which we follow in this work. |
| Hardware Specification | Yes | We train all models in bfloat16 mixed precision using the GPT-Neo X library (Andonian et al., 2023) across 256 A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions software like the GPT-Neo X library, Flash Attention 2, Language Model Evaluation Harness, and SymPy, but it does not specify version numbers for these components. |
| Experiment Setup | Yes | LLEMMA 7B is trained for 42, 000 steps with a global batch size of 4 million tokens and a 4096 token context length. This corresponds to roughly 23, 000 A100-hours. The learning rate is warmed up to 1 × 10−4 over 500 steps, then set to cosine decay to 1/30th of the maximum learning rate over 48, 000 steps. |