reproducibilityindex.ai

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Authors: Zichun Yu, Spandan Das, Chenyan Xiong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks.
Researcher Affiliation	Academia	Zichun Yu Spandan Das Chenyan Xiong School of Computer Science Carnegie Mellon University {zichunyu, spandand, cx}@andrew.cmu.edu
Pseudocode	Yes	Algorithm 1 Model-Aware Data Selection
Open Source Code	Yes	Our code is open-sourced at https://github.com/cxcscmu/MATES.
Open Datasets	Yes	We pretrain 410M/1B models with Pythia [5] architecture from scratch on the C4 dataset [50] as our pretraining model M
Dataset Splits	No	The paper mentions a '10% hold-out validation set' for the data influence model, but does not specify the train/validation/test splits for the main C4 pretraining dataset.
Hardware Specification	Yes	We run all experiments on 8 A6000 GPUs, which will take 2 days for 410M models and 4 days for 1B models.
Software Dependencies	No	The paper mentions software like 'Pythia architecture', 'BERT-base', and 'lm-evaluation-harness codebase' but does not specify their version numbers.
Experiment Setup	Yes	Table 5: Experimental configurations. Pretraining Dataset C4 Tokens 25B Model Pythia-410M/1B (randomly initialized) Steps 50k Sequence length 1024 Batch size 512 Max learning rate 0.001 ... For MATES selection, we sample 20% data with their influence scores as weights at each pretraining stage (10k steps). The sampling temperature τ is set to 1.0 to balance the data quality and the diversity. The update step U is also set to 10k...