OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

Authors: Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, we run small-scale experiments by training 1.4B parameter language models on Open Web Math, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data.
Researcher Affiliation Collaboration Keiran Paster , Marco Dos Santos , Zhangir Azerbayev, Jimmy Ba University of Toronto; Vector Institute for Artificial Intelligence University of Cambridge, Princeton University
Pseudocode No The paper includes a pipeline diagram (Figure 1) and describes steps in text, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We open-source the code needed to reproduce our results. ... Is the software that was used to preprocess/clean/label the data available? Yes. See supplementary materials.
Open Datasets Yes We publically release Open Web Math, a dataset of 14.7B tokens of high-quality mathematical web text. Our dataset can be found at https://huggingface.co/datasets/open-webmath/open-web-math on the Hugging Face Hub.
Dataset Splits No The paper describes training models and evaluating them on benchmarks, but it does not specify training, validation, or test dataset splits for its own data. It uses external benchmarks for evaluation without detailing how splits were created or used for its specific experiments.
Hardware Specification Yes We train the model using the GPT-Neo X library (Andonian et al., 2023) on 8 A100 80GB GPUs.
Software Dependencies No The paper mentions using the LLaMA tokenizer, Pythia architecture, and GPT-Neo X library. However, it does not specify exact version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes We use a batch size of 1M tokens and the same hyperparameters as Pythia otherwise. Table 10 provides specific hyperparameters: Model Size: 1.4 B, Layers: 24, Model Dim: 2048, Heads: 16, Learning Rate: 2.0 × 10−4, Batch Size: 1M.