MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Authors: Zichun Yu, Spandan Das, Chenyan Xiong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks. |
| Researcher Affiliation | Academia | Zichun Yu Spandan Das Chenyan Xiong School of Computer Science Carnegie Mellon University {zichunyu, spandand, cx}@andrew.cmu.edu |
| Pseudocode | Yes | Algorithm 1 Model-Aware Data Selection |
| Open Source Code | Yes | Our code is open-sourced at https://github.com/cxcscmu/MATES. |
| Open Datasets | Yes | We pretrain 410M/1B models with Pythia [5] architecture from scratch on the C4 dataset [50] as our pretraining model M |
| Dataset Splits | No | The paper mentions a '10% hold-out validation set' for the data influence model, but does not specify the train/validation/test splits for the main C4 pretraining dataset. |
| Hardware Specification | Yes | We run all experiments on 8 A6000 GPUs, which will take 2 days for 410M models and 4 days for 1B models. |
| Software Dependencies | No | The paper mentions software like 'Pythia architecture', 'BERT-base', and 'lm-evaluation-harness codebase' but does not specify their version numbers. |
| Experiment Setup | Yes | Table 5: Experimental configurations. Pretraining Dataset C4 Tokens 25B Model Pythia-410M/1B (randomly initialized) Steps 50k Sequence length 1024 Batch size 512 Max learning rate 0.001 ... For MATES selection, we sample 20% data with their influence scores as weights at each pretraining stage (10k steps). The sampling temperature τ is set to 1.0 to balance the data quality and the diversity. The update step U is also set to 10k... |