QuRating: Selecting High-Quality Data for Training Language Models
Authors: Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. |
| Researcher Affiliation | Academia | 1Department of Computer Science & Princeton Language and Intelligence (PLI), Princeton University. |
| Pseudocode | No | The paper describes its methods in prose and mathematical equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | To encourage further research, we release our prompts, models, and data at https://github.com/princeton-nlp/Qu Rating. |
| Open Datasets | Yes | We publicly release Qu Rated Pajama and make it available at huggingface.co/datasets/princeton-nlp/Qu Rated Pajama-260B. |
| Dataset Splits | Yes | We measure the perplexity over 50M tokens from Slim Pajama s held-out validation split. |
| Hardware Specification | Yes | Each model is trained on 8x NVIDIA H100, which costs 200 GPU hours for 30B tokens. |
| Software Dependencies | No | The paper mentions specific models (Sheared-Llama-1.3B, Llama-2-7B), activations (SwiGLU), and optimizers (Adam) by name and citation, but it does not specify version numbers for broader software dependencies like Python, PyTorch, or TensorFlow libraries. |
| Experiment Setup | Yes | We use a global batch size of 2048 sequences and a learning rate of 5e-4 with a cosine learning rate decay to 5e-5 and a linear warmup for the first 5% of training steps. Each model is trained on 8x NVIDIA H100, which costs 200 GPU hours for 30B tokens. We use a weight decay of 0.1 and train with Adam (Kingma & Ba, 2015) with hyperparameters β = (0.9, 0.95). |