Llemma: An Open Language Model for Mathematics
Authors: Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the MATH benchmark LLEMMA outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, LLEMMA is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.1 |
| Researcher Affiliation | Collaboration | 1 Princeton University 2 Eleuther AI 3 University of Toronto 4 Vector Institute 5 University of Cambridge 6 Carnegie Mellon University 7 University of Washington |
| Pseudocode | No | The paper does not include pseudocode or clearly labeled algorithm blocks. It describes the methods and training procedures in narrative text. |
| Open Source Code | Yes | We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.1 |
| Open Datasets | Yes | We publicly release Proof-Pile-2 at hf.co/datasets/Eleuther AI/proof-pile-2. |
| Dataset Splits | Yes | We release a canonical train, validation, and test split of the dataset, which we follow in this work. |
| Hardware Specification | Yes | We train all models in bfloat16 mixed precision using the GPT-Neo X library (Andonian et al., 2023) across 256 A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions software like the GPT-Neo X library, Flash Attention 2, Language Model Evaluation Harness, and SymPy, but it does not specify version numbers for these components. |
| Experiment Setup | Yes | LLEMMA 7B is trained for 42, 000 steps with a global batch size of 4 million tokens and a 4096 token context length. This corresponds to roughly 23, 000 A100-hours. The learning rate is warmed up to 1 × 10−4 over 500 steps, then set to cosine decay to 1/30th of the maximum learning rate over 48, 000 steps. |