OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
Authors: Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, we run small-scale experiments by training 1.4B parameter language models on Open Web Math, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. |
| Researcher Affiliation | Collaboration | Keiran Paster , Marco Dos Santos , Zhangir Azerbayev, Jimmy Ba University of Toronto; Vector Institute for Artificial Intelligence University of Cambridge, Princeton University |
| Pseudocode | No | The paper includes a pipeline diagram (Figure 1) and describes steps in text, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-source the code needed to reproduce our results. ... Is the software that was used to preprocess/clean/label the data available? Yes. See supplementary materials. |
| Open Datasets | Yes | We publically release Open Web Math, a dataset of 14.7B tokens of high-quality mathematical web text. Our dataset can be found at https://huggingface.co/datasets/open-webmath/open-web-math on the Hugging Face Hub. |
| Dataset Splits | No | The paper describes training models and evaluating them on benchmarks, but it does not specify training, validation, or test dataset splits for its own data. It uses external benchmarks for evaluation without detailing how splits were created or used for its specific experiments. |
| Hardware Specification | Yes | We train the model using the GPT-Neo X library (Andonian et al., 2023) on 8 A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions using the LLaMA tokenizer, Pythia architecture, and GPT-Neo X library. However, it does not specify exact version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | We use a batch size of 1M tokens and the same hyperparameters as Pythia otherwise. Table 10 provides specific hyperparameters: Model Size: 1.4 B, Layers: 24, Model Dim: 2048, Heads: 16, Learning Rate: 2.0 × 10−4, Batch Size: 1M. |