reproducibilityindex.ai

QuRating: Selecting High-Quality Data for Training Language Models

Authors: Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data.
Researcher Affiliation	Academia	1Department of Computer Science & Princeton Language and Intelligence (PLI), Princeton University.
Pseudocode	No	The paper describes its methods in prose and mathematical equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	To encourage further research, we release our prompts, models, and data at https://github.com/princeton-nlp/Qu Rating.
Open Datasets	Yes	We publicly release Qu Rated Pajama and make it available at huggingface.co/datasets/princeton-nlp/Qu Rated Pajama-260B.
Dataset Splits	Yes	We measure the perplexity over 50M tokens from Slim Pajama s held-out validation split.
Hardware Specification	Yes	Each model is trained on 8x NVIDIA H100, which costs 200 GPU hours for 30B tokens.
Software Dependencies	No	The paper mentions specific models (Sheared-Llama-1.3B, Llama-2-7B), activations (SwiGLU), and optimizers (Adam) by name and citation, but it does not specify version numbers for broader software dependencies like Python, PyTorch, or TensorFlow libraries.
Experiment Setup	Yes	We use a global batch size of 2048 sequences and a learning rate of 5e-4 with a cosine learning rate decay to 5e-5 and a linear warmup for the first 5% of training steps. Each model is trained on 8x NVIDIA H100, which costs 200 GPU hours for 30B tokens. We use a weight decay of 0.1 and train with Adam (Kingma & Ba, 2015) with hyperparameters β = (0.9, 0.95).