Scaling Data-Constrained Language Models
Authors: Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, Colin A. Raffel
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. |
| Researcher Affiliation | Collaboration | 1 Hugging Face 2 Harvard University 3 University of Turku |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations. |
| Open Datasets | Yes | Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations. |
| Dataset Splits | No | The paper mentions training models on 'subsets of C4' and reporting 'validation loss' and 'held-out test set' but does not provide specific train/validation/test dataset splits (e.g., percentages or exact sample counts) for reproducibility. |
| Hardware Specification | No | The paper mentions using 'generous computational resources on the LUMI supercomputer' but does not provide specific hardware details like GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | Yes | We use cosine learning rate schedules that decay 10 over the course of training for each model... we do not use early stopping... Other hyperparameters are based on prior work [89, 42] and detailed in Appendix S. |