reproducibilityindex.ai

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

Authors: Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Additionally, we run small-scale experiments by training 1.4B parameter language models on Open Web Math, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data.
Researcher Affiliation	Collaboration	Keiran Paster , Marco Dos Santos , Zhangir Azerbayev, Jimmy Ba University of Toronto; Vector Institute for Artificial Intelligence University of Cambridge, Princeton University
Pseudocode	No	The paper includes a pipeline diagram (Figure 1) and describes steps in text, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We open-source the code needed to reproduce our results. ... Is the software that was used to preprocess/clean/label the data available? Yes. See supplementary materials.
Open Datasets	Yes	We publically release Open Web Math, a dataset of 14.7B tokens of high-quality mathematical web text. Our dataset can be found at https://huggingface.co/datasets/open-webmath/open-web-math on the Hugging Face Hub.
Dataset Splits	No	The paper describes training models and evaluating them on benchmarks, but it does not specify training, validation, or test dataset splits for its own data. It uses external benchmarks for evaluation without detailing how splits were created or used for its specific experiments.
Hardware Specification	Yes	We train the model using the GPT-Neo X library (Andonian et al., 2023) on 8 A100 80GB GPUs.
Software Dependencies	No	The paper mentions using the LLaMA tokenizer, Pythia architecture, and GPT-Neo X library. However, it does not specify exact version numbers for these software components, which is required for reproducibility.
Experiment Setup	Yes	We use a batch size of 1M tokens and the same hyperparameters as Pythia otherwise. Table 10 provides specific hyperparameters: Model Size: 1.4 B, Layers: 24, Model Dim: 2048, Heads: 16, Learning Rate: 2.0 × 10−4, Batch Size: 1M.