Data Selection for Language Models via Importance Resampling
Authors: Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy S. Liang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents from the full Pile dataset in 4.5 hours. To measure whether hashed n-gram features preserve the aspects of the data that are relevant to the target, we define KL reduction, a data metric that measures the proximity between the selected pretraining data and the target on some feature space. Across 8 data selection methods (including expert selection), KL reduction on hashed n-gram features highly correlates with average downstream accuracy (r = 0.82). When selecting data for continued pretraining on a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia and books), DSIR improves over random selection and heuristic filtering baselines by 2 2.5% on the GLUE benchmark. |
| Researcher Affiliation | Academia | Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang Department of Computer Science Stanford University {xie, shibani, tengyuma, pliang}@cs.stanford.edu |
| Pseudocode | No | The paper describes the steps of its framework in prose and lists steps in appendices, but it does not include formally labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Code, selected data, and pretrained models are available at https://github.com/p-lambda/dsir. |
| Open Datasets | Yes | We consider selecting data from The Pile (1.6B examples) for continued pretraining of domain-specific LMs and training general-domain LMs from scratch. On 8 datasets from 4 domains (CS papers, biomedical papers, news, reviews), DSIR improves over Ro BERTa (no continued pretraining) by 2% on average and is even comparable to continued pretraining on expert-curated data from Gururangan et al. [24]. For general-domain LMs (Section 7), the data selection target is formal, clean text from Wikipedia and books, following GPT-3 [10]. |
| Dataset Splits | Yes | We reserve chunk 0 for validation purposes and only consider the last 29 chunks. We train the importance weight estimator or fasttext classifier from The Pile validation set, where the target is Wikipedia + Book Corpus2 + Gutenberg + Books3 and the raw data come from the rest of the data sources in The Pile. |
| Hardware Specification | Yes | DSIR selects 100M documents from the full Pile dataset in 4.5 hours on 1 CPU node with 96 cores. GPUs 4 Titan RTX (Table 6, 7, 10, 11). |
| Software Dependencies | No | The paper mentions software components like 'fasttext' and uses specific model architectures (e.g., 'BERT-base', 'RoBERTa-base'), but it does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch, NLTK). |
| Experiment Setup | Yes | Table 6: Hyperparameters for training general-domain LMs from scratch. Table 7: Hyperparameters for continued pretraining of general-domain LMs. Table 8: Dataset-specific hyperparameters for fine-tuning LMs on GLUE. Table 9: Shared hyperparameters for fine-tuning LMs on GLUE. Table 10: Hyperparameters for continued pretraining on domain-specific data. Table 11: Hyperparameters for fine-tuning on domain-specific data. |