Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
Authors: Zichun Yu, Spandan Das, Chenyan Xiong
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks. |
| Researcher Affiliation | Academia | Zichun Yu Spandan Das Chenyan Xiong School of Computer Science Carnegie Mellon University EMAIL |
| Pseudocode | Yes | Algorithm 1 Model-Aware Data Selection |
| Open Source Code | Yes | Our code is open-sourced at https://github.com/cxcscmu/MATES. |
| Open Datasets | Yes | We pretrain 410M/1B models with Pythia [5] architecture from scratch on the C4 dataset [50] as our pretraining model M |
| Dataset Splits | No | The paper mentions a '10% hold-out validation set' for the data influence model, but does not specify the train/validation/test splits for the main C4 pretraining dataset. |
| Hardware Specification | Yes | We run all experiments on 8 A6000 GPUs, which will take 2 days for 410M models and 4 days for 1B models. |
| Software Dependencies | No | The paper mentions software like 'Pythia architecture', 'BERT-base', and 'lm-evaluation-harness codebase' but does not specify their version numbers. |
| Experiment Setup | Yes | Table 5: Experimental configurations. Pretraining Dataset C4 Tokens 25B Model Pythia-410M/1B (randomly initialized) Steps 50k Sequence length 1024 Batch size 512 Max learning rate 0.001 ... For MATES selection, we sample 20% data with their influence scores as weights at each pretraining stage (10k steps). The sampling temperature τ is set to 1.0 to balance the data quality and the diversity. The update step U is also set to 10k... |