Llemma: An Open Language Model for Mathematics

Authors: Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the MATH benchmark LLEMMA outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, LLEMMA is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.1
Researcher Affiliation Collaboration 1 Princeton University 2 Eleuther AI 3 University of Toronto 4 Vector Institute 5 University of Cambridge 6 Carnegie Mellon University 7 University of Washington
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks. It describes the methods and training procedures in narrative text.
Open Source Code Yes We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.1
Open Datasets Yes We publicly release Proof-Pile-2 at hf.co/datasets/Eleuther AI/proof-pile-2.
Dataset Splits Yes We release a canonical train, validation, and test split of the dataset, which we follow in this work.
Hardware Specification Yes We train all models in bfloat16 mixed precision using the GPT-Neo X library (Andonian et al., 2023) across 256 A100 40GB GPUs.
Software Dependencies No The paper mentions software like the GPT-Neo X library, Flash Attention 2, Language Model Evaluation Harness, and SymPy, but it does not specify version numbers for these components.
Experiment Setup Yes LLEMMA 7B is trained for 42, 000 steps with a global batch size of 4 million tokens and a 4096 token context length. This corresponds to roughly 23, 000 A100-hours. The learning rate is warmed up to 1 × 10−4 over 500 steps, then set to cosine decay to 1/30th of the maximum learning rate over 48, 000 steps.