MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Authors: Zichun Yu, Spandan Das, Chenyan Xiong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks.
Researcher Affiliation Academia Zichun Yu Spandan Das Chenyan Xiong School of Computer Science Carnegie Mellon University {zichunyu, spandand, cx}@andrew.cmu.edu
Pseudocode Yes Algorithm 1 Model-Aware Data Selection
Open Source Code Yes Our code is open-sourced at https://github.com/cxcscmu/MATES.
Open Datasets Yes We pretrain 410M/1B models with Pythia [5] architecture from scratch on the C4 dataset [50] as our pretraining model M
Dataset Splits No The paper mentions a '10% hold-out validation set' for the data influence model, but does not specify the train/validation/test splits for the main C4 pretraining dataset.
Hardware Specification Yes We run all experiments on 8 A6000 GPUs, which will take 2 days for 410M models and 4 days for 1B models.
Software Dependencies No The paper mentions software like 'Pythia architecture', 'BERT-base', and 'lm-evaluation-harness codebase' but does not specify their version numbers.
Experiment Setup Yes Table 5: Experimental configurations. Pretraining Dataset C4 Tokens 25B Model Pythia-410M/1B (randomly initialized) Steps 50k Sequence length 1024 Batch size 512 Max learning rate 0.001 ... For MATES selection, we sample 20% data with their influence scores as weights at each pretraining stage (10k steps). The sampling temperature τ is set to 1.0 to balance the data quality and the diversity. The update step U is also set to 10k...