Data Selection for Language Models via Importance Resampling

Authors: Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy S. Liang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents from the full Pile dataset in 4.5 hours. To measure whether hashed n-gram features preserve the aspects of the data that are relevant to the target, we define KL reduction, a data metric that measures the proximity between the selected pretraining data and the target on some feature space. Across 8 data selection methods (including expert selection), KL reduction on hashed n-gram features highly correlates with average downstream accuracy (r = 0.82). When selecting data for continued pretraining on a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia and books), DSIR improves over random selection and heuristic filtering baselines by 2 2.5% on the GLUE benchmark.
Researcher Affiliation Academia Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang Department of Computer Science Stanford University {xie, shibani, tengyuma, pliang}@cs.stanford.edu
Pseudocode No The paper describes the steps of its framework in prose and lists steps in appendices, but it does not include formally labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Code, selected data, and pretrained models are available at https://github.com/p-lambda/dsir.
Open Datasets Yes We consider selecting data from The Pile (1.6B examples) for continued pretraining of domain-specific LMs and training general-domain LMs from scratch. On 8 datasets from 4 domains (CS papers, biomedical papers, news, reviews), DSIR improves over Ro BERTa (no continued pretraining) by 2% on average and is even comparable to continued pretraining on expert-curated data from Gururangan et al. [24]. For general-domain LMs (Section 7), the data selection target is formal, clean text from Wikipedia and books, following GPT-3 [10].
Dataset Splits Yes We reserve chunk 0 for validation purposes and only consider the last 29 chunks. We train the importance weight estimator or fasttext classifier from The Pile validation set, where the target is Wikipedia + Book Corpus2 + Gutenberg + Books3 and the raw data come from the rest of the data sources in The Pile.
Hardware Specification Yes DSIR selects 100M documents from the full Pile dataset in 4.5 hours on 1 CPU node with 96 cores. GPUs 4 Titan RTX (Table 6, 7, 10, 11).
Software Dependencies No The paper mentions software components like 'fasttext' and uses specific model architectures (e.g., 'BERT-base', 'RoBERTa-base'), but it does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, NLTK).
Experiment Setup Yes Table 6: Hyperparameters for training general-domain LMs from scratch. Table 7: Hyperparameters for continued pretraining of general-domain LMs. Table 8: Dataset-specific hyperparameters for fine-tuning LMs on GLUE. Table 9: Shared hyperparameters for fine-tuning LMs on GLUE. Table 10: Hyperparameters for continued pretraining on domain-specific data. Table 11: Hyperparameters for fine-tuning on domain-specific data.