QuRating: Selecting High-Quality Data for Training Language Models

Authors: Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data.
Researcher Affiliation Academia 1Department of Computer Science & Princeton Language and Intelligence (PLI), Princeton University.
Pseudocode No The paper describes its methods in prose and mathematical equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes To encourage further research, we release our prompts, models, and data at https://github.com/princeton-nlp/Qu Rating.
Open Datasets Yes We publicly release Qu Rated Pajama and make it available at huggingface.co/datasets/princeton-nlp/Qu Rated Pajama-260B.
Dataset Splits Yes We measure the perplexity over 50M tokens from Slim Pajama s held-out validation split.
Hardware Specification Yes Each model is trained on 8x NVIDIA H100, which costs 200 GPU hours for 30B tokens.
Software Dependencies No The paper mentions specific models (Sheared-Llama-1.3B, Llama-2-7B), activations (SwiGLU), and optimizers (Adam) by name and citation, but it does not specify version numbers for broader software dependencies like Python, PyTorch, or TensorFlow libraries.
Experiment Setup Yes We use a global batch size of 2048 sequences and a learning rate of 5e-4 with a cosine learning rate decay to 5e-5 and a linear warmup for the first 5% of training steps. Each model is trained on 8x NVIDIA H100, which costs 200 GPU hours for 30B tokens. We use a weight decay of 0.1 and train with Adam (Kingma & Ba, 2015) with hyperparameters β = (0.9, 0.95).