reproducibilityindex.ai

Llemma: An Open Language Model for Mathematics

Authors: Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the MATH benchmark LLEMMA outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, LLEMMA is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.1
Researcher Affiliation	Collaboration	1 Princeton University 2 Eleuther AI 3 University of Toronto 4 Vector Institute 5 University of Cambridge 6 Carnegie Mellon University 7 University of Washington
Pseudocode	No	The paper does not include pseudocode or clearly labeled algorithm blocks. It describes the methods and training procedures in narrative text.
Open Source Code	Yes	We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.1
Open Datasets	Yes	We publicly release Proof-Pile-2 at hf.co/datasets/Eleuther AI/proof-pile-2.
Dataset Splits	Yes	We release a canonical train, validation, and test split of the dataset, which we follow in this work.
Hardware Specification	Yes	We train all models in bfloat16 mixed precision using the GPT-Neo X library (Andonian et al., 2023) across 256 A100 40GB GPUs.
Software Dependencies	No	The paper mentions software like the GPT-Neo X library, Flash Attention 2, Language Model Evaluation Harness, and SymPy, but it does not specify version numbers for these components.
Experiment Setup	Yes	LLEMMA 7B is trained for 42, 000 steps with a global batch size of 4 million tokens and a 4096 token context length. This corresponds to roughly 23, 000 A100-hours. The learning rate is warmed up to 1 × 10−4 over 500 steps, then set to cosine decay to 1/30th of the maximum learning rate over 48, 000 steps.